DDIA Notes — Chapter 3: How Databases Store and Find Your Data
Working through Chapter 3 of Designing Data-Intensive Applications: how indexes actually work, from a dumb append-only log to hash indexes, SSTables, LSM-trees, B-trees, and the secondary indexes built on top of them.
Why I'm Writing This Down
I just finished Chapter 3 of Designing Data-Intensive Applications and almost nothing stuck the first time through. The chapter throws a lot at you fast — log-structured storage, hash indexes, SSTables, LSM-trees, B-trees, secondary indexes — and it's easy to lose the thread of why each idea exists.
These notes are my attempt to rebuild the chapter from the ground up in an order that actually makes sense to me. Each section sets up the next. I want to be able to come back in six months and re-load the whole mental model in fifteen minutes.
The Problem Indexes Are Solving
Imagine the world's dumbest database: a single file where every write just appends a line to the end.
SET user:42 = "Alice"
SET user:17 = "Bob"
SET user:42 = "Alice2" ← overwrote Alice
Writes are extremely fast — one sequential append. But reads are terrible. To find user:42, you have to scan the entire file. And since updates also append, the latest value is whichever entry sits further down, so you'd actually scan from the end backwards.
An index is a separate data structure that helps you find rows quickly without scanning everything. It's the index at the back of a textbook: the book is the main data, the index is a small helper that maps "topic → page 247."
The tradeoff that runs through the entire chapter:
Indexes speed up reads. They slow down writes, because every insert has to update the index too.
Every storage engine in this chapter is a different answer to "how do we make reads on a log-like structure fast without killing writes."
Hash Indexes: The First Idea
Keep an in-memory hash map: key → byte offset in the file.
In memory: On disk (the log):
{ offset 0: SET user:42 = "Alice"
"user:42": 45, offset 23: SET user:17 = "Bob"
"user:17": 23 offset 45: SET user:42 = "Alice2"
}
To read user:42, look up the hash map → offset 45 → seek directly to that spot on disk. One disk seek. Fast.
To write: append to the file, update the hash map entry.
This is the storage engine inside Bitcask (Riak). It's a great fit when you have a lot of writes and the set of distinct keys is small enough that the hash map fits in RAM.
The problems:
- The hash map must fit in RAM. One entry per unique key.
- Range queries are bad. Hash maps don't preserve key order, so "all users between 100 and 200" requires scanning everything.
- The log grows forever. Solution: split it into fixed-size segments and run compaction in the background — throw out keys that have been overwritten, then merge segments together.
The "compaction" pattern shows up everywhere from here on. Hold onto it.
SSTables: The Key Insight
What if, within each segment file, we kept the entries sorted by key?
A segment that's sorted by key is called an SSTable (Sorted String Table). It sounds like a small change, but it unlocks a chain of benefits:
Benefit A — Merging sorted files is cheap. Two sorted segments merge like the merge step of mergesort: walk through both in parallel, write the smaller key, advance.
Benefit B — You don't need every key in your in-memory index. Because keys are sorted, if you know user:10 sits at offset 100 and user:50 sits at offset 500, then user:30 must be somewhere in between. You only need a sparse index — maybe one entry per few KB of data. Now your index fits in RAM even for huge datasets.
Benefit C — Blocks can be compressed. Group nearby keys into a block, compress the block, save disk space and I/O.
LSM-Trees: Putting It Together
But wait — how do you write to a sorted file? You can't just append; the new key might belong in the middle.
The trick: keep recent writes in an in-memory sorted structure called a memtable (usually a balanced tree like a red-black tree, where inserts are cheap and the structure stays sorted). When the memtable gets big enough, dump its contents to disk as a brand-new SSTable. Older SSTables get merged together in the background.
Writes
│
▼
┌───────────────┐ ┌──────────────────┐
│ Memtable │ │ Write-ahead log │
│ (in RAM, │────────▶│ (on disk, │
│ sorted) │ │ for crash │
└───────────────┘ │ recovery) │
│ └──────────────────┘
flush when full
│
▼
┌─────────────────────────────────────────┐
│ SSTable 1 SSTable 2 SSTable 3 ... │ (on disk, immutable)
└─────────────────────────────────────────┘
│
background compaction
merges them together
The full picture:
- Write: add to in-memory memtable. Also append to a write-ahead log on disk so a crash doesn't lose the memtable.
- Read: check the memtable first. If not there, check the newest SSTable, then the next-newest, and so on.
- Background: flush the memtable to a new SSTable when it fills up; periodically compact and merge SSTables.
This architecture is called an LSM-tree (Log-Structured Merge-tree). It powers LevelDB, RocksDB, Cassandra, HBase, and the storage layer of plenty of other systems.
The "log-structured" part means: writes always go sequentially to disk. No in-place updates ever. Even deletes are just markers called tombstones that say "this key is dead" — the actual bytes go away during compaction.
One concern: reads might have to check several SSTables before finding a key. To avoid pointless disk reads, each SSTable has a Bloom filter — a small probabilistic data structure that can quickly tell you "this key is definitely not in this SSTable" or "it might be." If the Bloom filter says no, you skip that SSTable entirely.
B-Trees: The Other Approach
B-trees are the other major family of indexes, and they're what most traditional databases use — PostgreSQL, MySQL/InnoDB, Oracle, SQL Server.
B-trees take the opposite philosophy from LSM-trees: instead of appending and merging, they update data in place.
The data is broken into fixed-size pages (typically 4 KB, matching the OS page size). Each page contains sorted keys. Pages reference other pages, forming a tree:
┌─────────────────────────────────┐
│ root page: <100 | <500 | <900 │
└───┬─────────┬─────────┬─────────┘
│ │ │
┌───▼──┐ ┌───▼──┐ ┌───▼──┐
│ leaf │ │ leaf │ │ leaf │ (leaves hold actual
│ page │ │ page │ │ page │ keys + values, or
└──────┘ └──────┘ └──────┘ pointers to values)
To find a key: start at the root, see which range it falls in, follow the pointer down, repeat. The branching factor (how many children per page) is large — often several hundred — so even huge databases are only 3 or 4 levels deep. Three or four disk seeks to find anything.
To insert: find the right leaf page and write the new key into it. If the page is full, split it into two pages and update the parent. The tree stays balanced.
The dangerous moment: if you split a page and the process crashes before the parent is updated, the tree is corrupted. So B-trees use a write-ahead log (WAL) — every change is logged to an append-only file before it's applied to the tree. On recovery, replay the log.
Notice the philosophical opposite of LSM-trees: B-trees overwrite pages directly. The same key always lives in the same place. There's exactly one copy of each key in the index at any moment.
LSM-Trees vs B-Trees
The comparison everyone wants. Rough rules of thumb:
| LSM-Trees | B-Trees | |
|---|---|---|
| Writes | Faster — sequential appends only | Slower — random page writes, plus WAL |
| Reads | Slower in the worst case (may check several SSTables) | Faster and more predictable (one path through the tree) |
| Disk space | Better — tightly packed, good compression | Worse — pages are often half-empty after splits |
| Performance predictability | Less predictable — compaction can cause latency spikes | More predictable |
| Maturity | Common in newer systems (Cassandra, RocksDB, ScyllaDB) | Default in mature relational DBs (Postgres, MySQL, Oracle) |
Neither is universally better. LSM-trees win for write-heavy workloads where you can tolerate occasional compaction-induced latency spikes. B-trees win when you need consistent, predictable read and write latency, and they're battle-tested.
One subtlety the book pushes hard: both systems have write amplification — one logical write turns into multiple physical writes. B-trees write the page plus the WAL entry. LSM-trees rewrite data multiple times as it gets compacted through levels. Which one amplifies more depends entirely on the workload.
Secondary Indexes
Everything above is about the primary index: looking up rows by primary key. A secondary index is any other index — for example, "find all users where email = 'x@y.com'."
The main complication: secondary index values aren't unique. Many rows can share the same email-domain or the same city. So an index entry has to point to a list of matching rows.
Both LSM-trees and B-trees can implement secondary indexes — the data structure is the same, the entries just point to row locations instead of containing the row data itself.
A few flavors worth knowing:
- Heap file approach. The index stores row IDs; the actual rows live separately in a "heap file." Updates that don't change the indexed column are cheap because the heap file row stays in the same place.
- Clustered index. The actual row data lives inside the index. MySQL's InnoDB uses this for the primary key — the primary key index is the table. Reads are fast, but secondary indexes have to do an extra lookup to get back to the primary key.
- Covering index. Stores some (not all) columns in the index itself, so common queries can be answered without touching the main table at all. A middle ground between heap and clustered.
Storing Everything in Memory
The chapter closes with a quick aside: if your dataset fits in RAM, you can drop most of the above complexity. In-memory databases like Redis and Memcached keep everything in memory and use disk only for durability (write-ahead log + periodic snapshots).
The surprising thing is that in-memory databases aren't faster mainly because RAM is faster than disk — modern OSes cache hot pages in RAM anyway. They're faster because they can use data structures that would be awkward on disk (e.g., Redis's sorted sets, priority queues, lists) without worrying about how to lay them out across fixed-size pages.
Key Takeaways
Every storage engine is a different point on the read-cost vs write-cost tradeoff. Pick the one that matches your workload, not the one with the coolest name.
LSM-trees turn random writes into sequential writes. The price is that reads may have to check multiple SSTables, and compaction work happens in the background.
B-trees update in place. The price is the write-ahead log to make page splits crash-safe, and the random I/O pattern for writes.
Sorted data is a superpower. It enables sparse indexes, cheap merges, range queries, and prefix scans. Both SSTables and B-tree leaf pages exploit sortedness; hash indexes can't.
Compaction is a recurring pattern. Any append-only or immutable storage system eventually needs a background process to reclaim space and merge fragmented data. LSM-trees, log-structured filesystems, and object storage systems like S3 all rely on it.
Secondary indexes are just more of the same. They use the same underlying data structures as primary indexes — they just point to rows instead of containing them.