Encoding, Schemas, and Evolution
Data outlives code. Encoding is how objects become bytes; schema evolution is how old and new code keep understanding those bytes during change.
The real compatibility problem is not one writer and one reader. It is many versions of writers and readers, deployed gradually, reading old data forever.
Encoding is an architectural decision
In memory, data is objects, structs, dictionaries, tensors, or rows. On disk and over the network, it is bytes. Encoding is the conversion between those worlds. It affects latency, storage cost, debuggability, cross-language interoperability, and most importantly: deployment safety.
Language-native serialization is convenient but dangerous as a long-term contract. It often ties bytes to one runtime and one class definition. Text formats such as JSON are human-readable and ubiquitous, but loose typing and missing schemas push validation into application code. Binary schema formats such as Protocol Buffers, Thrift, and Avro make the contract explicit and compact, but require schema discipline.
Choose encoding by boundary. For internal RPC between services, compact schema-based binary formats may win. For logs that humans inspect and many teams consume, self-describing JSON may be worth the cost. For analytical storage, columnar formats such as Parquet encode by column because scan efficiency matters more than row object fidelity.
Schema evolution is deployment evolution
Data systems rarely deploy everything at once. A rolling deploy means old and new service versions coexist. A mobile client may be months old. A data lake stores records written years ago. Therefore schemas must evolve in two directions.
Backward compatibility: new code can read old data. This is usually achieved by giving new fields defaults and keeping old fields readable. Forward compatibility: old code can tolerate data written by newer code, usually by ignoring unknown fields and avoiding changes that reinterpret existing fields.
The safe pattern is expand-contract. First add optional fields and write both old and new forms. Then update all readers. Then stop writing the old field. Later, after retention windows and backfills, remove it. The slow part is not code. It is waiting for every reader and every stored record that still assumes the old world.
Schema contracts in ML systems
ML systems turn schema mistakes into quiet model bugs. A renamed feature can become all nulls. A changed label definition can make an offline metric incomparable. A tokenization version mismatch can make two datasets look compatible while representing different sequences.
Treat feature definitions, dataset schemas, prompt/response formats, and model input signatures as data contracts. Version them. Validate them at ingestion boundaries. Record the schema version in the dataset snapshot and model registry. If serving code and training code disagree on schema, you have train/serve skew even if both programs run.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| JSON / CSV | Readable, ubiquitous, easy to inspect | Large, weak typing, compatibility discipline lives in code |
| Protocol Buffers / Thrift | Compact, explicit schema, strong RPC contracts | Schema registry/process required; not naturally columnar |
| Avro | Good for data files/logs with schema evolution | Less pleasant for hand inspection; schema management still needed |
| Parquet / Arrow | Columnar scans, compression, analytics efficiency | Not ideal as an RPC or mutable row format |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
Without compatibility rules, every deploy becomes a distributed transaction over all producers, consumers, stored data, and clients. That transaction will fail in real life.
Design prompts
- Design a safe migration from
user_agetobirth_yearin a feature pipeline. - When would JSON be the right choice despite being slower and larger?
- What metadata should a model registry store to detect schema mismatch?