Encoding, Schemas, and Evolution

Data outlives code. Encoding is how objects become bytes; schema evolution is how old and new code keep understanding those bytes during change.

First principle

The real compatibility problem is not one writer and one reader. It is many versions of writers and readers, deployed gradually, reading old data forever.

version N writer ----+ +--> shared bytes on disk / log / wire --> version N reader version N+1 writer--+ --> version N+1 reader Backward compatibility: new code reads old data. Forward compatibility: old code tolerates new data.

Encoding is an architectural decision

In memory, data is objects, structs, dictionaries, tensors, or rows. On disk and over the network, it is bytes. Encoding is the conversion between those worlds. It affects latency, storage cost, debuggability, cross-language interoperability, and most importantly: deployment safety.

Language-native serialization is convenient but dangerous as a long-term contract. It often ties bytes to one runtime and one class definition. Text formats such as JSON are human-readable and ubiquitous, but loose typing and missing schemas push validation into application code. Binary schema formats such as Protocol Buffers, Thrift, and Avro make the contract explicit and compact, but require schema discipline.

Choose encoding by boundary. For internal RPC between services, compact schema-based binary formats may win. For logs that humans inspect and many teams consume, self-describing JSON may be worth the cost. For analytical storage, columnar formats such as Parquet encode by column because scan efficiency matters more than row object fidelity.

Schema evolution is deployment evolution

Data systems rarely deploy everything at once. A rolling deploy means old and new service versions coexist. A mobile client may be months old. A data lake stores records written years ago. Therefore schemas must evolve in two directions.

Backward compatibility: new code can read old data. This is usually achieved by giving new fields defaults and keeping old fields readable. Forward compatibility: old code can tolerate data written by newer code, usually by ignoring unknown fields and avoiding changes that reinterpret existing fields.

The safe pattern is expand-contract. First add optional fields and write both old and new forms. Then update all readers. Then stop writing the old field. Later, after retention windows and backfills, remove it. The slow part is not code. It is waiting for every reader and every stored record that still assumes the old world.

Schema contracts in ML systems

ML systems turn schema mistakes into quiet model bugs. A renamed feature can become all nulls. A changed label definition can make an offline metric incomparable. A tokenization version mismatch can make two datasets look compatible while representing different sequences.

Treat feature definitions, dataset schemas, prompt/response formats, and model input signatures as data contracts. Version them. Validate them at ingestion boundaries. Record the schema version in the dataset snapshot and model registry. If serving code and training code disagree on schema, you have train/serve skew even if both programs run.

Trade-offs

Choice	Buys	Costs
JSON / CSV	Readable, ubiquitous, easy to inspect	Large, weak typing, compatibility discipline lives in code
Protocol Buffers / Thrift	Compact, explicit schema, strong RPC contracts	Schema registry/process required; not naturally columnar
Avro	Good for data files/logs with schema evolution	Less pleasant for hand inspection; schema management still needed
Parquet / Arrow	Columnar scans, compression, analytics efficiency	Not ideal as an RPC or mutable row format

What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

Without compatibility rules, every deploy becomes a distributed transaction over all producers, consumers, stored data, and clients. That transaction will fail in real life.

Design prompts

Design a safe migration from user_age to birth_year in a feature pipeline.
When would JSON be the right choice despite being slower and larger?
What metadata should a model registry store to detect schema mismatch?