data_engineering / 04 · storage lesson 4 / 11

Storage & file formats

Before you transform data you have to store it, and the format you choose silently sets the cost of every job that touches it afterward. JSONL is where data is born; Parquet is where it should live; Arrow is how it travels.

Where we are
Lesson 03 landed raw data in the bronze layer. This lesson decides how it's stored so that lesson 05's distributed transforms can read it cheaply. The wrong format here makes every later stage scan terabytes it never needed.

Row vs columnar — the one distinction that matters

A dataset is a table of records. There are two ways to lay it on disk, and they have opposite performance profiles.

Row-major (JSONL):   {prompt, response, lang, score}{prompt, response, lang, score}...
                     ▶ whole record contiguous. Great to append one row, terrible to read one column.

Columnar (Parquet):  [prompt prompt prompt ...][response response ...][lang ...][score ...]
                     ▶ each column contiguous. Read one column without touching the rest. Compresses hard.

Post-training reads are almost always column- and row-selective: "give me the prompt and response columns for rows where lang='en' and score>0.8." Row storage must read every byte of every record to answer that. Columnar storage reads only the columns you asked for, and — because Parquet stores per-chunk min/max statistics — skips entire blocks of rows whose score can't match. That is column projection and predicate pushdown, and together they are why the same query costs 50× less on Parquet.

The three formats you'll actually use

FormatLayoutUse it forCost
JSONLRow, textIngest / interchange. Human-readable, append-only, every tool reads it. The bronze landing format.No schema, no compression, no pushdown. Full scan every time.
ParquetColumnar, compressedThe silver/gold store. Predicate + column pushdown, splittable for parallel reads, typed schema.Not human-readable; rewrite to edit; small-file problem if mis-partitioned.
ArrowColumnar, in-memoryThe wire/RAM format. Zero-copy hand-off between Spark/Ray/Daft/pandas; the IPC layer.In-memory representation, not a long-term storage format.
The rule of thumb
Land as JSONL, store as Parquet, move as Arrow. JSONL at the door because everything emits it; Parquet for everything that lives on disk and gets queried; Arrow whenever data crosses a process boundary, so no one pays to serialize/deserialize.

Partitioning & sharding — physical layout

One 2 TB Parquet file is unusable: you can't read it in parallel and you can't skip parts of it. Two physical-layout knobs fix that:

Over-partitioning is a trap
Partitioning on a high-cardinality column (e.g. user_id) creates millions of tiny files — each read needs a separate open, and the metadata alone can dwarf the data. Partition on a handful of low-cardinality columns you actually filter by; let file-size targets handle the rest.

Interactive · what a query actually reads

Pick a format, choose how many columns your job needs and how selective its filter is, and watch how many bytes leave disk. The dataset is fixed at 1 TB with 10 equal-width columns.

Bytes scanned: JSONL vs Parquet
Column projection reads only the columns you select. Predicate pushdown uses per-row-group statistics to skip blocks that can't match your filter. JSONL can do neither — it always reads everything.
Bytes read
vs full scan
Read @ 2 GB/s
Cost @ $0.02/GB

Takeaway

What to carry to lesson 05
Store curated data as partitioned, compressed Parquet so reads are column- and row-selective; keep files in the 128–512 MB band so they map cleanly to the parallel partitions lesson 05's engines consume. The format isn't a detail — it's the difference between a transform that reads 20 GB and one that reads 1 TB to produce the same result.