all_lessons / data_intensive_systems / 04 · encoding and schema evolution lesson 4 / 16 · ~10 min

Encoding, Schemas, and Evolution

Data outlives code. Encoding is how objects become bytes; schema evolution is how old and new code keep understanding those bytes during change.

First principle

The real compatibility problem is not one writer and one reader. It is many versions of writers and readers, deployed gradually, reading old data forever.

version N writer ----+ +--> shared bytes on disk / log / wire --> version N reader version N+1 writer--+ --> version N+1 reader Backward compatibility: new code reads old data. Forward compatibility: old code tolerates new data.

Encoding is an architectural decision

In memory, data is objects, structs, dictionaries, tensors, or rows. On disk and over the network, it is bytes. Encoding is the conversion between those worlds. It affects latency, storage cost, debuggability, cross-language interoperability, and most importantly: deployment safety.

Language-native serialization is convenient but dangerous as a long-term contract. It often ties bytes to one runtime and one class definition. Text formats such as JSON are human-readable and ubiquitous, but loose typing and missing schemas push validation into application code. Binary schema formats such as Protocol Buffers, Thrift, and Avro make the contract explicit and compact, but require schema discipline.

Choose encoding by boundary. For internal RPC between services, compact schema-based binary formats may win. For logs that humans inspect and many teams consume, self-describing JSON may be worth the cost. For analytical storage, columnar formats such as Parquet encode by column because scan efficiency matters more than row object fidelity.

Schema evolution is deployment evolution

Data systems rarely deploy everything at once. A rolling deploy means old and new service versions coexist. A mobile client may be months old. A data lake stores records written years ago. Therefore schemas must evolve in two directions.

Backward compatibility: new code can read old data. This is usually achieved by giving new fields defaults and keeping old fields readable. Forward compatibility: old code can tolerate data written by newer code, usually by ignoring unknown fields and avoiding changes that reinterpret existing fields.

The safe pattern is expand-contract. First add optional fields and write both old and new forms. Then update all readers. Then stop writing the old field. Later, after retention windows and backfills, remove it. The slow part is not code. It is waiting for every reader and every stored record that still assumes the old world.

Schema contracts in ML systems

ML systems turn schema mistakes into quiet model bugs. A renamed feature can become all nulls. A changed label definition can make an offline metric incomparable. A tokenization version mismatch can make two datasets look compatible while representing different sequences.

Treat feature definitions, dataset schemas, prompt/response formats, and model input signatures as data contracts. Version them. Validate them at ingestion boundaries. Record the schema version in the dataset snapshot and model registry. If serving code and training code disagree on schema, you have train/serve skew even if both programs run.

Trade-offs

ChoiceBuysCosts
JSON / CSVReadable, ubiquitous, easy to inspectLarge, weak typing, compatibility discipline lives in code
Protocol Buffers / ThriftCompact, explicit schema, strong RPC contractsSchema registry/process required; not naturally columnar
AvroGood for data files/logs with schema evolutionLess pleasant for hand inspection; schema management still needed
Parquet / ArrowColumnar scans, compression, analytics efficiencyNot ideal as an RPC or mutable row format
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

Without compatibility rules, every deploy becomes a distributed transaction over all producers, consumers, stored data, and clients. That transaction will fail in real life.

Design prompts

  1. Design a safe migration from user_age to birth_year in a feature pipeline.
  2. When would JSON be the right choice despite being slower and larger?
  3. What metadata should a model registry store to detect schema mismatch?