ADR-004: Asset-Lock Checksum Facts and STAC Projection
| Status | Date | Implementation |
|---|---|---|
| Draft | 2026-06-04 | Not implemented yet. |
Context
The active asset-lock schema records asset identity, structured location, and the current object facts from ADR-003:
size_bytes
etag
last_modified
items enrich currently writes only size_bytes back to STAC as file:size.
It deliberately removes source file:checksum. The active lock does not write
checksum facts back to STAC metadata.
The STAC File Info extension defines file:checksum as lowercase hexadecimal
Multihash, but checksum columns are outside the active asset-lock schema.
External checksum metadata is not uniform. S3 exposes checksum values alongside
a checksum type that distinguishes FULL_OBJECT from COMPOSITE, and S3 ETags
may or may not be MD5 digests depending on upload and encryption details. The
asset-lock checksum design therefore needs to distinguish portable file
checksums from store-specific validators.
STAC Items may already contain file:checksum. Copying that value into
assets.lock.parquet during metadata-only lock creation would make the lock
look more certain than it is. The package would be storing source metadata, not
a fact observed from the storage system that stacpkg will later validate
against.
ETags have a similar problem. They are useful storage validators, but S3 multipart ETags and weak HTTP ETags are not full-object file checksums that mean the same thing across stores. Package file digests and OCI descriptor digests describe package files, not external assets referenced by the asset-lock table.
Decision
Implement checksum support by making the asset lock the source of truth and projecting from there into STAC metadata.
Add nullable file_checksum to the asset-lock schema. The column belongs with
the other object facts:
size_bytes
file_checksum
etag
last_modified
file_checksum in assets.lock.parquet means a checksum fact that the lock can
use for validation against the current store. It does not mean an unverified
checksum string copied from input STAC metadata. STAC file:checksum is a
projection from the lock, not the primary package record.
file_checksum must be encoded as lowercase hexadecimal Multihash compatible
with the STAC File Info extension. Reuse the existing checksum helpers instead
of adding another checksum representation.
When stacpkg calculates a checksum from bytes, it uses SHA-256 by default.
SHA-256 is already representable as STAC Multihash, is supported by S3 checksum
metadata, and avoids relying on MD5 or SHA-1 for newly calculated checksums.
Manifest-only or no-probe locking may fill size_bytes from STAC file:size,
but it must not copy STAC file:checksum into assets.lock.parquet.
Fill file_checksum only from evidence tied to the store location:
- provider metadata when it reports a supported full-object checksum;
- explicit ETag promotion when the caller selects that method and the ETag is compatible with a single-part MD5 checksum;
- explicit byte-stream calculation when the selected checksum method requires it;
- copy or relocation paths when they observe or derive a full-object checksum fact.
The first provider-metadata implementation supports S3-style full-object checksum fields that can be converted to STAC Multihash:
ChecksumSHA256
ChecksumSHA1
ChecksumSHA512
ChecksumMD5
x-amz-checksum-sha256
x-amz-checksum-sha1
x-amz-checksum-sha512
x-amz-checksum-md5
Accept those fields only when the companion checksum type says FULL_OBJECT,
using either ChecksumType or x-amz-checksum-type. Ignore COMPOSITE
checksums, CRC checksums, and XXHash checksums.
Add checksum strategies to lock-producing commands such as asset-lock derive
and build:
metadata: use only supported full-object checksums reported by object metadata; this is the default checksum strategy when metadata probing is enabled;use-etag: promote only compatible single-part MD5 ETags to Multihash;calculate-if-needed: use provider metadata when available and stream the asset bytes only when no supported checksum is reported;calculate-always: stream the asset bytes and calculate the checksum even if metadata already contains a checksum.
Byte-stream calculation must be explicit so metadata-only workflows do not
unexpectedly read large assets. --no-probe-metadata remains unable to fill
file_checksum.
Do not add a verification/import command for existing source STAC
file:checksum in this implementation. Default lock creation remains
store-evidence first. Existing source STAC checksums are not imported into
assets.lock.parquet.
Keep etag separate from file_checksum. Do not automatically treat storage
validators as file checksums that mean the same thing across stores.
Relocation must preserve or refresh checksums deliberately. asset-lock
relocate --dry-run may carry file_checksum forward only when the destination
lock is a planned alternate view of the same bytes. Actual copy or relocation
paths should prefer a checksum observed or calculated for the destination
object. They may preserve a verified source checksum only when the copy path
guarantees byte-for-byte transfer and no destination checksum is available.
asset-lock validate should compare locked file_checksum values with current
store evidence when the backend reports a comparable checksum. A validation mode
that streams bytes may calculate and compare the checksum when metadata is not
enough.
items enrich projects file_checksum back to STAC as file:checksum only
when the lock row contains a current store-derived checksum fact. If a lock row
has no file_checksum, enrichment must continue removing any stale source
file:checksum for that asset.
Validation results are command output, not saved lock state. Results such as
valid and errors must not be stored in assets.lock.parquet.
Implementation Strategy
Add file_checksum as a nullable asset-lock fact column and keep etag
separate. Update schema metadata, fixtures, and column filtering so only
accepted asset-lock fields are preserved.
Add --checksum {metadata,use-etag,calculate-if-needed,calculate-always} to
lock-producing and validation commands. The default is metadata; byte
calculation uses SHA-256. --no-probe-metadata does not fill
file_checksum.
Normalize provider checksum metadata in one place. For the first implementation,
convert S3-style ChecksumSHA256, ChecksumSHA1, ChecksumSHA512,
ChecksumMD5, and matching x-amz-checksum-* headers only when the checksum
type is FULL_OBJECT. Ignore COMPOSITE, CRC, and XXHash checksums.
During relocation, carry file_checksum through dry-run mapping only when the
new lock describes the same bytes. Copying paths should prefer a destination
checksum when available, calculate one when requested, and preserve a verified
source checksum only when byte-for-byte copy semantics are guaranteed.
Validation compares locked file_checksum with current store evidence, or with
a calculated SHA-256 checksum when the selected strategy allows streaming bytes.
Validation mismatches remain command output and are never written back into
assets.lock.parquet.
items enrich maps size_bytes to file:size and file_checksum to
file:checksum. If the lock row has no file_checksum, enrichment continues
to remove stale source file:checksum.
Tests cover Multihash normalization, S3-style full-object metadata, rejected
composite and unsupported checksums, ETag promotion limits, metadata-only
behavior, SHA-256 calculation, validation mismatches, relocation propagation,
STAC enrichment, and the rule that source STAC file:checksum is not imported.
Alternatives Considered
- Copy source STAC
file:checksum: Easy to preserve, but it can be stale or unrelated to the store location being locked. - Treat all ETags as checksums: Convenient for some S3 objects, but wrong for multipart uploads, weak HTTP validators, and storage-specific validators.
- Always compute checksums by streaming bytes: Strong evidence, but too expensive for large assets and surprising for metadata-only package creation.
- Use package file digests for asset-lock checksums: Package file and OCI descriptor digests describe package files. Asset-lock checksums describe referenced assets and need their own evidence from the store.
- Store validation results in the lock: Makes a past validation look lasting. Validation results are observations at one time and should remain command output or diagnostics.
Consequences
Checksum support requires an asset-lock schema revision. Existing readers and
tests need to handle the old schema and the new schema deliberately, and fixture
generation must include the new nullable file_checksum column once the schema
is accepted.
The package checksum contract becomes asset-lock first. Workflows that require
byte-level reproducibility will create or receive an asset lock with
file_checksum; STAC file:checksum is emitted later by items enrich from
that lock value.
Lock creation and validation gain explicit checksum strategy choices. Cheap metadata-only workflows remain cheap, while stronger workflows can opt into ETag promotion or byte-stream calculation.
Calculated checksum workflows standardize on SHA-256. Provider metadata support starts with full-object S3-style MD5, SHA-1, SHA-256, and SHA-512 fields only; composite checksums and unsupported algorithms are ignored rather than stored as weaker facts.
Relocation paths must decide whether they are preserving a verified source checksum, observing a destination checksum, or calculating a destination checksum. Dry-run mapping cannot imply new byte evidence.
Existing source STAC file:checksum remains non-authoritative input metadata.
This implementation does not add a trusted import path for it.
Tests must cover Multihash normalization, S3-style provider metadata, ETag
promotion limits, metadata-only behavior, SHA-256 byte-calculation strategies,
validation mismatches, relocation propagation, and STAC enrichment from
file_checksum.