How Amazon S3 Works Under the Hood
A deep dive into how S3 actually stores 100 trillion objects — the metadata/data split, flat namespace, placement service, durability through erasure coding, versioning, multipart uploads, garbage collection, and consistency.
Why I Wanted to Understand This
S3 is one of those services you use constantly without thinking twice. Upload a file, get a URL, done. But once you start asking how does it store 100 trillion objects and still guarantee eleven nines of durability, the engineering underneath gets genuinely interesting.
These are my notes from going deep on S3 internals — how the data layer works, why the flat namespace isn't actually flat, and how they make deletion, versioning, and consistency work at this scale.
The Single Most Important Decision: Split Metadata from Data
The foundational design decision in S3 — and all object storage — is that metadata and data are stored in completely separate systems.
This is borrowed directly from UNIX filesystems, which use inodes: filenames and attributes live in one place, raw bytes live in another, with a pointer between them.
Client request: GET /company-docs/report.pdf
│
▼
API Service ──────▶ Metadata Store
│ "What's the UUID for
│ report.pdf in company-docs?"
│ ◀──── UUID: abc-123
▼
Data Store
"Give me bytes for abc-123"
│
▼
Returns bytes
The metadata store holds object name, bucket ID, size, timestamps, tags, and a UUID pointing to the actual data. Records are small (~1 KB), frequently updated, and need fast point lookups.
The data store holds raw bytes indexed only by UUID. It has no idea what anything is called — filenames, paths, and buckets never enter the picture. Data is immutable once written.
Why separate them? They have completely different characteristics, so they need different storage engines, scaling strategies, and consistency models. Keeping them separate lets each be optimized independently.
The Flat Namespace Isn't Really Flat
When you upload photos/2024/vacation.jpg, there's no photos directory and no 2024 subdirectory anywhere in the system. The entire string — slashes included — is just the object key. It's a string.
The "folder" you see in the S3 console is an illusion maintained by prefix scanning. When you list a "folder," you're running:
"Give me all keys starting with
photos/2024/"
The metadata store keeps keys in sorted order (range-partitioned). All keys sharing that prefix sit adjacent on the same shard or neighboring shards, so the scan is efficient regardless of how many keys are in the bucket.
This beats a real hierarchical filesystem because:
- No directory inodes to lock when modifying contents
- No recursive operations for listing or deleting
- A billion keys with the same prefix is just a bigger sorted range, not a deeper tree
The trade-off: "renaming a folder" means updating every key with that prefix individually. There's no atomic directory rename. Empty folders don't exist — a folder only exists as long as objects with that prefix do.
UUIDs: The Bridge Between Names and Bytes
The UUID is what connects a human-readable object name to physical bytes on disk. Here's the full upload sequence for PUT /company-docs/report.pdf:
- API service receives the request and checks permissions.
- Bytes stream to the data store. The data store generates a UUID, writes the bytes, returns the UUID.
- API service writes a metadata record:
bucket: company-docs key: report.pdf uuid: 7f3a8b2c-9d4e-4a1f-bc56-... size: 2,148,393 bytes etag: <hash> - Returns 200 OK.
Two things worth noting here. First, the data is written before the metadata record exists. If the system crashed between steps 2 and 3, you'd have orphaned bytes in the data store with no metadata pointing to them — that's fine, garbage collection handles it. The opposite order would be much worse: a metadata pointer pointing to bytes that don't exist. Write data first, commit pointer last.
Second, UUIDs can be generated on any node without coordination — no central counter, no contention. Every version of an object also gets its own UUID, so versions are fully independent objects in the data store.
The Placement Service: Where Does Data Go?
The placement service is the brain of the data store. It decides where every byte gets written.
It maintains a virtual cluster map — a registry of every data node, which rack it's in, which availability zone, how many disks, and how much free space. Every data node sends a heartbeat every few seconds. No heartbeat for ~15 seconds means the node is marked down.
Because losing the placement service would make the entire data store unavailable for writes, it runs as a 5 or 7 node cluster using Raft or Paxos consensus:
| Cluster Size | Failures Tolerated |
|---|---|
| 5 nodes | 2 |
| 7 nodes | 3 |
When a write comes in, the placement service picks destination nodes by spreading across different racks and availability zones — so a single rack failure can't take out all copies of the data.
Data Nodes: How Bytes Actually Hit Disk
A naive design stores one file per object. This falls apart at scale because spinning hard drives only handle ~100–150 random IOPS, and filesystems slow down with millions of files in a directory.
The fix: pack many objects into a single large extent — an append-only file typically capped at 1 GB.
Extent file (1 GB):
┌──────────────────────────────────────────────────────────┐
│ [len][object A bytes][checksum] [len][object B bytes]... │
└──────────────────────────────────────────────────────────┘
↑ ↑
UUID a → here UUID b → here
Each data node keeps a local index: UUID → (extent_id, offset, length). When the active extent reaches the size threshold, it's sealed (made read-only) and a new one opens.
The performance gain: many small random writes become one large sequential write. Sequential I/O on a spinning disk is ~100x faster than random I/O. Sealed extents are also immutable, which makes replication and erasure coding significantly simpler.
Durability: Two Approaches, One Promise
S3 Standard targets eleven nines of durability (99.999999999%). At scale, hardware fails constantly — disks die daily, racks lose power. Durability comes from spreading copies across independent failure domains.
Replication
Store N full copies (usually 3) on different nodes, racks, or zones.
Pros: Simple, fast reads from any replica, easy recovery.
Cons: 3x storage cost — you pay for 3 PB to store 1 PB.
Erasure Coding (Reed-Solomon)
Split data into k data shards + m parity shards. Tolerate any m failures without data loss.
| Config | Total Shards | Storage Overhead | Failures Tolerated |
|---|---|---|---|
| k=10, m=4 | 14 | 1.4x | 4 |
| k=6, m=3 | 9 | 1.5x | 3 |
Compare to 3x replication: erasure coding gives higher fault tolerance at less than half the storage cost. At petabyte scale, this is an enormous saving.
The catch: reconstruction requires reading k shards and running the decoding math, which adds latency and CPU overhead. Updates are also expensive since parity must be recomputed — this is fine because objects are immutable anyway.
S3 reportedly uses erasure coding for the bulk of stored data across availability zones, with replication reserved for very hot or very small objects where the EC overhead isn't worth it. A background scrubbing process continuously verifies checksums and triggers repairs before bit rot becomes a problem.
Versioning: How Old Copies Survive
When versioning is enabled, every PUT to an existing key creates a new version instead of overwriting. Each version gets its own UUID in the data store. The metadata layer tracks an ordered list:
bucket: company-docs
key: report.pdf
versions:
- version_id: v3 (latest), uuid: abc-123, size: 2.1MB
- version_id: v2, uuid: def-456, size: 2.0MB
- version_id: v1, uuid: ghi-789, size: 1.8MB
Deleting a versioned object doesn't delete anything. It adds a delete marker as a new version. Plain GETs return 404, but older versions are untouched. Remove the delete marker and the object is "undeleted."
To actually free storage, you delete a specific version ID. That removes the metadata entry for that UUID and marks the bytes for garbage collection. Lifecycle rules can automate expiring non-current versions after N days to prevent a frequently-written bucket from accumulating unbounded storage.
Multipart Uploads: Surviving Large Transfers
For multi-GB files, a single HTTP request is fragile — a connection drop at 99% means starting over. Multipart upload breaks the transfer into independent, resumable parts.
Step 1 — Initiate:
POST /bucket/key?uploads
→ Returns: upload_id
Step 2 — Upload parts in parallel:
PUT /bucket/key?partNumber=1&uploadId=... → ETag for part 1
PUT /bucket/key?partNumber=2&uploadId=... → ETag for part 2
PUT /bucket/key?partNumber=3&uploadId=... → ETag for part 3
Parts (5 MB to 5 GB each) can upload in any order, concurrently. Each returns an ETag the client must remember.
Step 3 — Complete:
POST /bucket/key?uploadId=...
Body: ordered list of (partNumber, ETag) pairs
Each part is initially stored as its own internal object with its own UUID. On completion, the metadata layer creates a manifest pointing to the parts in sequence. Reads stream from each part sequentially. Background compaction can later merge them into a single contiguous extent.
If a client never calls complete, the parts sit orphaned consuming storage. Lifecycle rules typically abort incomplete multipart uploads after N days.
Garbage Collection: Reclaiming Immutable Storage
Because data is immutable and writes are append-only, you can't free space inline. A background process handles it.
Garbage is created by:
- Deleted objects (bytes still on disk after metadata removal)
- Old versions past their retention window
- Aborted multipart uploads
- Failed uploads where data was written but metadata was never committed
When an object is deleted, the metadata store writes a tombstone marking the UUID as dead. The data store still holds the bytes. A background GC process runs on a schedule:
- Scan for tombstones older than a grace period (~24 hours).
- Notify data nodes to release those extent regions.
- Compact extents that have become mostly garbage: read live objects, rewrite them to a new extent, delete the old one.
The grace period exists because tombstones must propagate to all replicas before deletion is final. Without it, a stale read from a slow replica could briefly return a deleted object.
Compaction reads and writes are sequential, so they're fast and don't compete meaningfully with user traffic. This is the same pattern used by RocksDB, Cassandra, and most LSM-tree databases.
Consistency: Strong Since 2020
Before December 2020, S3 had eventual consistency for overwrites and deletes — you might read a stale version right after writing. It now provides strong read-after-write consistency for all operations.
After a successful PUT, any GET immediately returns the new object. After a DELETE, any GET returns 404. No stale reads.
The trick is that the metadata layer is the source of truth, and it uses consensus (Paxos/Raft variants) to make metadata writes strongly consistent:
Write path:
1. Write bytes to data store (eventually consistent, fine)
2. Commit metadata pointer via quorum (strongly consistent)
3. Acknowledge to client only after step 2
Read path:
1. Look up current UUID in metadata store (sees latest committed pointer)
2. Fetch bytes for that UUID from data store
3. Return to client
The bytes don't have to be everywhere immediately — the pointer does. Since the metadata pointer is strongly consistent, the user experience is strongly consistent even though replicas of the actual bytes catch up asynchronously.
Key Takeaways
Separate systems for different access patterns. Metadata and data have completely different sizes, mutability, and consistency needs. The split is what makes each layer optimizable.
Append-only extent files solve the IOPS problem. One file per object doesn't scale. Batching into sequential append-only files turns random I/O into sequential I/O.
Erasure coding is the economical durability answer. 1.4–1.5x storage overhead with 3–4 failure tolerance beats the 3x cost of simple replication at petabyte scale.
Immutability makes everything else simpler. Replication, erasure coding, compaction, and GC all become dramatically easier when objects can't be partially modified.
The metadata pointer is where consistency lives. The data bytes can propagate asynchronously; what matters is that the pointer to the current version is immediately visible to all readers.