Skip to main content
avatarJay Patel

DDIA Notes — Chapter 4: Encoding and Evolution

Working through Chapter 4 of Designing Data-Intensive Applications: why we need encoding formats at all, how JSON/Protobuf/Avro/Thrift differ, what backward and forward compatibility actually mean, and how data flows between processes via databases, RPC, and message queues.

Posted May 26, 202611 min readBackend, Distributed Systems

The Problem: Data Lives in Two Worlds

Inside a running program, data lives as in-memory objects: structs, hash maps, lists, classes. The CPU can follow pointers, the layout is whatever the language wants it to be.

But the moment data needs to leave that process — written to disk, sent over a network, dropped into a message queue — it has to become a sequence of bytes. A flat blob. No pointers, no language-specific objects, just bytes.

Going from in-memory → bytes is called encoding (also: serialization, marshalling). Going the other way is decoding (deserialization, unmarshalling). Every time data crosses a process boundary, both happen.

   ┌─────────────────┐                          ┌─────────────────┐
   │  Process A      │                          │  Process B      │
   │                 │                          │                 │
   │  User object    │  encode    bytes  decode │  User object    │
   │  { id: 42,      │ ────────▶ ░░░░░░ ────────▶ { id: 42,       │
   │    name: "Jay"} │           ░░░░░░         │    name: "Jay"} │
   │                 │                          │                 │
   └─────────────────┘                          └─────────────────┘

This sounds boring until you remember: process A and process B might be different versions of the same service, written months apart, deployed at different times. Encoding is where software evolution lives or dies.

Why Language-Specific Encodings Are a Trap

Every language has a built-in way to do this — Java has Serializable, Python has pickle, Ruby has Marshal, etc. They're tempting because they're one line of code.

Don't use them across process boundaries. Three big reasons:

  • They lock you into one language. A Python service can't decode a Java object.
  • They're security nightmares. Most of these formats can instantiate arbitrary classes during decoding, which means a malicious payload can execute code. Real CVEs have happened in every major language's built-in serializer.
  • They're slow and bloated. They weren't designed for cross-network use.

So the chapter quickly moves past these to the formats actually used in practice.

Textual Formats: JSON, XML, CSV

These are the formats you've probably been using your whole career. They're human-readable and language-agnostic, which is why they're everywhere.

But they have real problems the book is careful to point out:

  • Ambiguous number handling. JSON doesn't distinguish integers from floats. Numbers larger than 2^53 lose precision in JavaScript. Twitter famously had to return tweet IDs as both a number and a string because JS clients couldn't handle the integer.
  • No binary support. Want to send an image? Base64-encode it, which inflates the size by ~33%.
  • No schema. You and I have to agree out-of-band on what fields exist and what their types are. If we disagree, things break at runtime.
  • Verbose. Every record repeats every field name as a string.

These formats are fine for many things — public APIs, config files, debugging. But at high volume, the verbosity and lack of schema start to hurt.

Binary Formats with Schemas: Thrift and Protocol Buffers

Both Thrift (from Facebook) and Protocol Buffers (from Google) work the same way conceptually. You define a schema:

message Person {
  required string name = 1;
  optional int32 age = 2;
  repeated string hobbies = 3;
}

Then a code generator produces classes in your language. To encode, you call .serialize(); to decode, you call .parse(bytes).

The encoded form is tiny. Field names disappear entirely — they're replaced by the small integer tag numbers (the = 1, = 2, = 3 above). The schema is what tells the decoder "tag 1 is a string called name."

JSON:    {"name":"Jay","age":27}            →  23 bytes
Protobuf encoding (roughly):
  tag=1, type=string, length=3, "Jay"
  tag=2, type=varint, 27                    →  ~9 bytes

This is fast, compact, and unambiguous about types. The cost: you can't read the bytes by eye, and producer and consumer must both have the schema.

Why Tag Numbers Are the Whole Trick

Here's the part the chapter wants you to internalize. The tag numbers are what make these formats evolvable.

  • Adding a new field? Pick a new tag number. Old code reading new data sees an unknown tag and skips it. New code reading old data sees the field is missing and uses a default. → Forward and backward compatible.
  • Renaming a field? Totally fine — the name is just for humans. The tag number is what's encoded.
  • Removing a field? Fine, as long as you never reuse that tag number for something different later.
  • Changing a type? Risky. Some changes are safe (int32 → int64 in most cases), most aren't.

You must never change the tag number of an existing field. That's the cardinal rule.

One Protobuf-specific gotcha: in older versions, fields were required or optional. Marking a field required is a one-way door — if you ever want to remove it later, you can't, because old readers will refuse messages that lack it. Modern Protobuf removed required entirely for this reason.

Avro: A Different Take

Avro looks similar — schema-based binary encoding — but it makes a clever choice that's worth understanding because it's different from Thrift/Protobuf.

Avro encoded data has no tag numbers and no type information in the bytes. It's literally just the values, concatenated. To decode, you must have the schema, because there's no way to know where one field ends and the next begins without it.

Person record encoded in Avro (roughly):
  3 (length of "Jay"), "Jay", 27          ← that's it. No tags, no field names.

This works only if both the writer and the reader have the schema. So Avro distinguishes two:

  • Writer's schema — what the producer used when encoding.
  • Reader's schema — what the consumer expects.

They don't have to match. Avro's library resolves differences between them at decode time: if the reader's schema has a field the writer didn't, fill in a default; if the writer had a field the reader doesn't, skip it; if the order differs, match by name.

This sounds fancy but it has a practical consequence: you have to ship the writer's schema along with the data, or make it discoverable. Common patterns:

  • For a big file with millions of records, store the schema once at the top of the file.
  • For a stream of messages, register the schema in a schema registry (Confluent's Kafka tooling popularized this) and embed just a version ID with each message.
  • For RPC, the two sides exchange schemas during the connection handshake.

The payoff of this design: schemas can be generated from a database table, evolved dynamically, and don't need a code-generation step. Avro is especially popular in the Hadoop/Kafka ecosystem for this reason.

Backward and Forward Compatibility

This is the vocabulary the chapter keeps using, and it's worth slowing down on because they're easy to confuse.

  • Backward compatibility: newer code can read what older code wrote. The new version still understands the old format.
  • Forward compatibility: older code can read what newer code wrote. The old version doesn't break when it sees a newer format — it ignores what it doesn't understand.
                 reads data written by
                 ┌──────────────────────────┐
                 │            OLD     NEW   │
                 │  OLD       ✓       ?     │ ← forward compatibility
   code version  │                          │
                 │  NEW       ?       ✓     │ ← backward compatibility
                 └──────────────────────────┘

Backward compatibility is the easier one — when you write new code, you know what the old format looked like. Forward compatibility is harder because the old code has to gracefully handle data it has never seen before. The "skip unknown fields" behavior of Protobuf/Thrift/Avro is what makes forward compatibility possible.

Why does this matter so much in practice? Because in any non-trivial deployment, old and new versions of code run at the same time:

  • Rolling deployments. You upgrade servers one at a time; for a while, half are old and half are new, all hitting the same database.
  • Mobile apps. Users don't all upgrade simultaneously. The server has to handle requests from app versions going back months.
  • Microservices. Service A might be on v5 while service B is still on v3.

Encoding format choices either help you here or fight you.

How Data Actually Flows Between Processes

Once you understand encoding, the rest of the chapter is about the three main ways encoded data flows between processes. Each has its own evolution quirks.

Flow 1: Through Databases

The process that writes the data and the process that reads it may be the same service, separated by time. You wrote a record last year; today, a newer version of the code reads it back.

This means a database needs both backward compatibility (new code reads old records) and forward compatibility (old code, still running on some servers during a rolling deploy, reads records written by new code without breaking).

A subtle pitfall the book calls out: when old code reads a record with a new field it doesn't understand, then writes that record back, it must preserve the unknown field rather than dropping it. Otherwise the new field silently disappears when an old service touches the record. Good libraries handle this; ad-hoc code often doesn't.

Migrations are also worth mentioning here. Adding a column is usually easy — the database fills NULL for existing rows. Changing a column's meaning or type is hard and usually requires a multi-step migration: add the new column, dual-write, backfill, switch reads, drop the old column.

Flow 2: Through Service Calls (REST and RPC)

When service A calls service B over the network, both encoding and a calling convention are involved.

REST is a style, not a protocol — typically HTTP + JSON, organized around resources and verbs (GET, POST, etc.). Easy to debug, language-agnostic, no code generation required.

RPC (Remote Procedure Call) tries to make a network call look like a local function call. The old idea (CORBA, Java RMI) failed because the abstraction leaks badly — network calls fail in ways local calls don't (timeouts, partial failures, retries that may or may not be idempotent). Modern RPC frameworks like gRPC (which uses Protobuf) and Apache Thrift acknowledge the network's existence: they expose futures/streams, retries, deadlines, and don't pretend latency is zero.

Compatibility in service calls: servers usually deploy before clients, so backward compatibility on requests (new server handles old client's request) and forward compatibility on responses (old client handles new server's response) are what you need. The same field-tagging tricks from Protobuf/Avro carry over.

Flow 3: Through Message Queues (Async Messaging)

The third pattern: producer writes a message to a broker (Kafka, RabbitMQ, etc.); one or more consumers read it later. This is decoupled — the producer doesn't know or care who's consuming, and consumers don't have to be running when the message is sent.

Compatibility matters even more here than in RPC because:

  • Multiple consumers may be on different versions reading the same topic.
  • Messages may sit in the queue for hours or days before being read.
  • A consumer that crashes on a message it doesn't understand can poison the whole queue.

Schema registries (Avro + Confluent's tooling, or Protobuf with similar setups) are heavily used here precisely to keep producers and consumers in sync as schemas evolve.

A useful framing the book offers: messaging is essentially a one-way RPC with a buffer in the middle. The compatibility concerns are the same; the failure modes are just decoupled in time.

Key Takeaways

Encoding is the boundary where evolution happens. Picking a format isn't just about size or speed — it's about whether you can change the format later without coordinating every service deploy.

Avoid language-specific serializers across process boundaries. They lock you in and they're a security risk.

Schema-based binary formats (Protobuf, Thrift, Avro) win on size, speed, and explicit evolution rules. Their main cost is the schema management itself.

Tag numbers (Protobuf/Thrift) and writer/reader schemas (Avro) are the mechanisms behind forward and backward compatibility. Adding fields is safe; removing or repurposing tag numbers is not.

Backward = new code reads old data. Forward = old code reads new data. Both matter during rolling deploys, with mobile clients, and across microservices.

Three dataflow modes, same compatibility story. Whether data moves through a database, an RPC call, or a message queue, the question is always: can both sides handle each other's schema drift?

#ddia#encoding#serialization#protobuf#avro#thrift#schema-evolution#rpc

Licensed under CC BY 4.0

Share: