Storage & file formats

Before you transform data you have to store it, and the format you choose silently sets the cost of every job that touches it afterward. JSONL is where data is born; Parquet is where it should live; Arrow is how it travels.

Where we are

Lesson 03 landed raw data in the bronze layer. This lesson decides how it's stored so that lesson 05's distributed transforms can read it cheaply. The wrong format here makes every later stage scan terabytes it never needed.

Row vs columnar — the one distinction that matters

A dataset is a table of records. There are two ways to lay it on disk, and they have opposite performance profiles.

Row-major (JSONL):   {prompt, response, lang, score}{prompt, response, lang, score}...
                     ▶ whole record contiguous. Great to append one row, terrible to read one column.

Columnar (Parquet):  [prompt prompt prompt ...][response response ...][lang ...][score ...]
                     ▶ each column contiguous. Read one column without touching the rest. Compresses hard.

Post-training reads are almost always column- and row-selective: "give me the prompt and response columns for rows where lang='en' and score>0.8." Row storage must read every byte of every record to answer that. Columnar storage reads only the columns you asked for, and — because Parquet stores per-chunk min/max statistics — skips entire blocks of rows whose score can't match. That is column projection and predicate pushdown, and together they are why the same query costs 50× less on Parquet.

The three formats you'll actually use

Format	Layout	Use it for	Cost
JSONL	Row, text	Ingest / interchange. Human-readable, append-only, every tool reads it. The bronze landing format.	No schema, no compression, no pushdown. Full scan every time.
Parquet	Columnar, compressed	The silver/gold store. Predicate + column pushdown, splittable for parallel reads, typed schema.	Not human-readable; rewrite to edit; small-file problem if mis-partitioned.
Arrow	Columnar, in-memory	The wire/RAM format. Zero-copy hand-off between Spark/Ray/Daft/pandas; the IPC layer.	In-memory representation, not a long-term storage format.

The rule of thumb

Land as JSONL, store as Parquet, move as Arrow. JSONL at the door because everything emits it; Parquet for everything that lives on disk and gets queried; Arrow whenever data crosses a process boundary, so no one pays to serialize/deserialize.

Partitioning & sharding — physical layout

One 2 TB Parquet file is unusable: you can't read it in parallel and you can't skip parts of it. Two physical-layout knobs fix that:

Partitioning — split files into directories by a column, e.g. /lang=en/source=annotation/part-*.parquet. A query filtered on lang='en' reads only that directory; the rest is skipped at the filesystem level (partition pruning). Partition on columns you filter on, with low cardinality.
Sharding / file size — within a partition, aim for files of ~128–512 MB. Too large and you lose parallelism (lesson 05's partitions map to files); too small and you drown in metadata and open-file overhead — the infamous small-file problem.

Over-partitioning is a trap

Partitioning on a high-cardinality column (e.g. user_id) creates millions of tiny files — each read needs a separate open, and the metadata alone can dwarf the data. Partition on a handful of low-cardinality columns you actually filter by; let file-size targets handle the rest.

Interactive · what a query actually reads

Pick a format, choose how many columns your job needs and how selective its filter is, and watch how many bytes leave disk. The dataset is fixed at 1 TB with 10 equal-width columns.

Takeaway

What to carry to lesson 05

Store curated data as partitioned, compressed Parquet so reads are column- and row-selective; keep files in the 128–512 MB band so they map cleanly to the parallel partitions lesson 05's engines consume. The format isn't a detail — it's the difference between a transform that reads 20 GB and one that reads 1 TB to produce the same result.