Pseudo Content Addressable Storage within S3

Designing pseudo-content addressable storage systems in S3

Table Of Contents

Today I Explained

While reviewing an AWS S3 bucket you may have comes across files named similar to 87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7. This being a file, sometimes a zip archive, with a named composed of what appears to be a random set of numbers and letters. These files aren’t actually published into S3 with a random name, but rather are published with their checksum as their name.

A checksum is a the result of a hashing algorithm running over the entire contents of the file. If the shashing algorithm is stable, meaning it produces the same result for the same input, then you’ll receive the same checksum when passing the file to the algorithm. The above string was generated using echo "a" | sha256sum, but you are able to create these checksums from files as well.

Why do this with S3? As discussed previously, it can be for the purposes of making CloudFormation deployments more manageable, but there is another alternative which is to enable consistency within the Content Delivery Network (CDN) within S3.

The use of checksums within these kind of delivery networks has a couple of interesting properties. First is that it allows us to consistently discover where an artifact is accessible at regardless of the source. An artifact will always be available at ${sha256sum file} (or equivalent hashing algorithm).

Although the exact same method is possible with a naming convention. By having the files be known by a hashing algorithm result, it shifts the content delivery network (CDN) from being a curated interface, with names and hierarchies, to now one favouring accessing files by programmatic means.

References to these files in infrastructure as code shift from computed paths, to baked in values. This shifts the responsibility of accessing these artifacts from people to process. Although it possible to determine the checksum, and lookup the artifact using that, it is a disjoint process. A friendly layout conveys meaning like s3://artifacts/node/v1.2.3/linux/arm64/node.tar, while the machine approach is just s3://{resolve:ssm:/artifacts/name}/{sha256sum file}.

On the loss of diagnostics

One of the frustrations that comes about with this approach is that useful diagnostic information (e.g. the bucket, architecture, filetype, version) is lost by the shift to checksums. At a glance, immediate errors such as an incorrect architecture or wrong version are no longer provided by the path.

For some this might be more ideal, as it pushes to responsibly for verifying an artifact to tools developed to confirm this information. As it isn’t a strict requirement that a file annotated with arm64, is actually compatible with arm64.

On the combining semantic and checksums

It is a pattern that exists with some AWS Cloud Development Kit (CDK) deployments to publish artifacts using a semantic path (s3://artifacts/node/v1.2.3/), but use checksums for the names of the files themselves (v1.2.3/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7.zip).

Although it can assist with providing some semantic information, it often leads to mismatched expectations, and ultimately shift towards a fully semantic model.