DDIA Notes — Chapter 5: Replication

Chapter 5 is where DDIA stops being about a single machine and starts being about several machines pretending to agree. That shift is where my intuition fell apart — once there's more than one copy of the data, almost everything I assumed about reads and writes stops being automatically true.

The chapter has a clean skeleton, and it helps to see it before the details:

There are exactly three ways to replicate data — single-leader, multi-leader, and leaderless. Everything else in the chapter is either (a) a knob on one of those, or (b) a problem that one of those creates.

Why Replicate At All?

Keeping a copy of the same data on multiple machines buys you three things:

Lower latency — put data geographically close to users.
Higher availability — the system keeps working even if some nodes go down.
Higher read throughput — many machines can serve reads in parallel.

If the data never changed, replication would be trivial: copy it once and you're done. The entire difficulty of this chapter is handling changes — making sure writes propagate to every copy without things getting out of sync.

Architecture 1: Single-Leader Replication

The most common model, and the one most relational databases use by default (PostgreSQL, MySQL, Oracle Data Guard, SQL Server). Also used by MongoDB, Kafka, and RabbitMQ in various forms.

The setup: one replica is designated the leader (a.k.a. master / primary). The rest are followers (slaves / secondaries / read replicas).

                  writes
                    │
                    ▼
              ┌───────────┐
              │  LEADER   │
              └─────┬─────┘
        replication │ stream
          ┌─────────┼─────────┐
          ▼         ▼         ▼
     ┌────────┐ ┌────────┐ ┌────────┐
     │FOLLOWER│ │FOLLOWER│ │FOLLOWER│
     └────────┘ └────────┘ └────────┘
          ▲         ▲         ▲
          └─────────┴─────────┘
                  reads

The rules:

All writes go to the leader. The leader writes to its own storage, then sends the change to every follower as part of a replication log.
Reads can go to the leader OR any follower. This is the win — reads scale out across followers.

That's the whole idea. The complications come from the questions it raises: how fast do followers get updated, what happens when a node dies, and what the replication log actually contains.

The Sync/Async Knob

When the leader gets a write, does it wait for followers to confirm before telling the client "done"?

Synchronous: leader waits for the follower to confirm it received the write before reporting success. The follower is guaranteed up to date. But if that follower is slow or down, the write blocks — the whole system stalls.
Asynchronous: leader sends the write and reports success immediately, without waiting. Fast and resilient to slow followers. But if the leader crashes before the write propagates, that write is lost.

In practice, fully synchronous replication is rare because one slow node freezes everything. The common compromise is semi-synchronous: one follower is synchronous, the rest are async. If the sync follower fails, an async one is promoted to sync. This guarantees at least two nodes have the data.

Most large-scale systems lean async, and accept that a leader crash can lose the last few writes. That's a real, conscious durability tradeoff — not an accident.

Setting Up a New Follower

You can't just copy the leader's files while writes are happening — the data is a moving target. The clean process:

Take a consistent snapshot of the leader at one point in time (without locking the whole DB).
Copy the snapshot to the new follower.
The follower connects to the leader and requests all changes since the snapshot. It identifies that exact point using a position in the replication log (Postgres calls it a log sequence number; MySQL uses binlog coordinates).
Once it's caught up, it keeps streaming changes live.

Handling Node Outages

This is where single-leader gets interesting, because the two failure cases are very different.

Follower fails — easy. The follower knows the last transaction it processed (from its local log). On restart, it asks the leader for everything since then and catches up. Called catch-up recovery. No big deal.

Leader fails — hard. This requires failover: promoting a follower to be the new leader. The steps are:

Detect the leader has failed (usually a timeout — no heartbeat for N seconds).
Choose a new leader (election, or a pre-designated controller picks one).
Reconfigure the system so writes go to the new leader.

Failover is full of sharp edges, and the book lists them deliberately:

Lost writes. With async replication, the new leader may not have received the most recent writes from the old leader. When the old leader rejoins, those writes are usually discarded — which can violate clients' durability expectations. The book's example: GitHub once had a case where a promoted follower had a stale MySQL autoincrement counter, which had already been used as keys in Redis, causing data to be handed to the wrong users.
Split brain. Two nodes both believe they're the leader and both accept writes. Without a mechanism to detect and shut one down, the data diverges and corrupts.
Timeout tuning. Too short → unnecessary failovers during a brief network blip (which makes things worse if the system is already loaded). Too long → longer downtime before recovery. There's no universally correct value.

The uncomfortable takeaway: there are no easy solutions, which is why some teams choose to do failover manually rather than automatically.

What's Actually In the Replication Log?

Four different ways to implement the stream of changes from leader to follower, each with tradeoffs:

Method	How	Main problem
Statement-based	Ship the literal SQL statements	Nondeterminism: `NOW()`, `RAND()`, autoincrement, side-effecting functions produce different results on each replica
Write-ahead log (WAL) shipping	Ship the low-level storage-engine log	Tightly coupled to the storage format — followers must run a near-identical engine version, which blocks zero-downtime upgrades
Logical (row-based) log	Ship a description of row changes, decoupled from the engine	More verbose, but decoupled — different versions and even different storage engines can interoperate (this is MySQL's row-based binlog)
Trigger-based	Application-level triggers fire on changes	Most flexible (selective replication, cross-system), but higher overhead and more bug-prone

The logical log is the sweet spot for most modern setups, and it's also what powers change data capture — letting external systems (search indexes, caches, data warehouses) subscribe to a database's changes.

The Big Problem: Replication Lag

With async replication, a follower is always slightly behind the leader. Usually that's milliseconds. Under load, it can be seconds or worse. This gap is replication lag, and it produces three specific, named anomalies that the book wants you to recognize. Each has its own fix.

Anomaly 1: Reading Your Own Writes

You post a comment, the write goes to the leader, then your page refresh reads from a follower that hasn't caught up yet — and your own comment is gone. Maddening.

The guarantee you want is read-after-write consistency: a user always sees their own updates. Fixes:

Read anything the user might have modified from the leader (e.g., always read your own profile from the leader, but other people's from followers).
Track the timestamp of the user's last write; for a short window after, only read from replicas that are at least that current.

Anomaly 2: Moving Backward in Time

You refresh twice. The first read hits an up-to-date follower and shows a new comment. The second read hits a more-lagged follower and the comment vanishes again. You've seen time go backward.

The guarantee you want is monotonic reads: you may read stale data, but you never read older data than you've already seen. The fix is simple — make each user always read from the same replica (e.g., chosen by a hash of their user ID), so they never bounce to a more-stale one.

Anomaly 3: Seeing Effects Before Causes

Mr. Poons asks a question; Mrs. Cake answers it. A third observer, reading from a sharded database, sees the answer before the question — because the two writes landed on different partitions with different lag. The conversation is nonsense.

The guarantee you want is consistent prefix reads: if writes happen in a certain order, anyone reading them sees them in that same order. This mainly bites in partitioned/sharded systems where different partitions replicate independently. The fix involves making sure causally-related writes go to the same partition, or tracking causal dependencies explicitly.

The meta-point: "eventual consistency" is a vague promise, and these three anomalies are what it actually feels like when it bites you.

Architecture 2: Multi-Leader Replication

What if you allow more than one leader, each accepting writes, and each acting as a follower to the others?

You'd rarely do this inside a single datacenter (the added complexity isn't worth it). The use cases where it earns its keep:

Multi-datacenter operation. A leader in each datacenter. Writes are processed locally (low latency) and replicated across datacenters asynchronously. A whole datacenter can go offline without halting writes.
Offline-capable clients. A calendar app on your phone is effectively a leader — it accepts writes while offline and syncs later. Every device is a one-node datacenter.
Collaborative editing. Google-Docs-style real-time editing is essentially the same problem.

The benefit is obvious: writes don't all funnel through one node. The cost is the thing that dominates the rest of the section: write conflicts.

The Conflict Problem

Two leaders accept a write to the same record at the same time. Now you have two divergent versions and no single authority to say which is right.

   Leader A                         Leader B
   sets title = "B"                 sets title = "C"
        │                                │
        └────────► conflict! ◄───────────┘
            both writes are valid,
            which one wins?

Approaches, roughly in order of preference:

Conflict avoidance. Route all writes for a given record to the same leader. No conflict can occur. The simplest and most common real-world strategy — until that leader fails and you have to reroute.
Converge to a consistent state. All replicas must end up agreeing. Options:
- Last write wins (LWW): attach a timestamp/ID, keep the highest. Simple, but silently discards data — one of the two real writes just vanishes.
- Give each write a unique ID, highest wins. Same data-loss problem as LWW.
- Merge the values (e.g., concatenate "B/C"). Preserves both but may produce garbage.
- Record the conflict and let application code (or the user) resolve it later.
Custom resolution logic. Run app-defined code either on write (as conflicts are detected) or on read (present all versions to the app/user and let them pick). CRDTs and operational transformation are the rigorous versions of this.

Topologies

How leaders pass writes to each other matters:

All-to-all (every leader to every other) — most common, most robust, but messages can arrive out of causal order.
Circular and star — fewer connections, but a single node failure can interrupt the flow, and they have ordering issues too.

The ordering problem (a write and its causally-dependent follow-up arriving in the wrong order at some node) is handled with version vectors, which is the same machinery used in leaderless systems below.

Architecture 3: Leaderless Replication

The third model throws out the leader entirely. Any replica can accept a write directly. This is the Dynamo-style approach, popularized by Amazon's Dynamo and used by Riak, Cassandra, and Voldemort.

The client (or a coordinator node acting on its behalf) sends each write to several replicas, and reads from several replicas too.

Quorums

The core formula. With n replicas, if every write is confirmed by w of them and every read queries r of them, then:

   w + r > n   ⟹   every read overlaps at least one node
                   that saw the latest write

A typical config is n=3, w=2, r=2 (2 + 2 > 3 ✓). The overlap is what guarantees a read sees the freshest value — as long as you can tell which of the returned values is freshest (via version numbers).

Why not require all replicas every time? Availability. With a quorum, a write still succeeds when one replica is down, and a read still succeeds when one replica is down. You trade the certainty of "all nodes agree" for the resilience of "enough nodes agree."

Keeping Replicas In Sync

Since there's no leader pushing a clean log, two background mechanisms repair divergence:

Read repair. When a client reads from several replicas and notices one returned a stale value, it writes the newer value back to the stale replica. Repairs happen as a side effect of normal reads.
Anti-entropy. A background process constantly compares replicas and copies missing data. Not tied to any particular read, so values rarely read still eventually converge.

The Catch With Quorums

w + r > n sounds airtight but the book is careful to list the leaks: with sloppy quorums, writes can land on nodes that aren't the "home" replicas during a network partition (then get forwarded back later, called hinted handoff), which boosts write availability but means a read quorum might not overlap the write quorum. Concurrent writes, writes that fail on some nodes, and various edge cases also weaken the guarantee. Quorums give you better odds, not absolute consistency.

Detecting Concurrent Writes

The hardest part. Two clients write to the same key concurrently, and replicas receive them in different orders. Who wins?

Last write wins (LWW) — pick one by timestamp, discard the other. Cassandra's default. Only safe when losing data is acceptable (e.g., immutable/append data); otherwise it's silent loss.
"Happens-before" and concurrency. Two operations are concurrent if neither knew about the other. If write B knew about write A (built on top of it), B can safely supersede A. If they're truly concurrent, that's a real conflict to resolve.
Version numbers / version vectors. The server tracks a version per replica and bundles them into a vector. By comparing version vectors, the system can tell whether two writes are concurrent (real conflict) or causally ordered (safe to overwrite). This is the rigorous foundation under both leaderless conflict detection and multi-leader topologies.

Key Takeaways

There are only three replication architectures. Single-leader (one writer, easy reasoning, failover pain), multi-leader (multiple writers, conflict pain), and leaderless (no writer, quorum + repair). Almost every real system is a variation on these.

Synchronous vs asynchronous is a durability-vs-availability dial. Async is fast and resilient to slow nodes but can lose recent writes on a crash. Most large systems pick async and live with the risk.

Replication lag has three named failure modes. Read-your-writes, monotonic reads, and consistent-prefix reads. "Eventual consistency" is just the absence of these guarantees.

Failover is genuinely dangerous. Lost writes, split brain, and timeout tuning have no clean answers — which is why automatic failover isn't always the right call.

Conflicts are the price of accepting writes in more than one place. Multi-leader and leaderless both have to detect and resolve them; the cleanest strategy is to avoid them by routing related writes to one place.

Quorums (w + r > n) give better consistency odds, not certainty. Sloppy quorums, concurrent writes, and partial failures all poke holes in the guarantee.

Version vectors are the real tool for reasoning about "who knew what." They let a system distinguish a genuine concurrent conflict from a safe causal overwrite — the foundation under both multi-leader and leaderless conflict handling.