Storage & file formats
Before you transform data you have to store it, and the format you choose silently sets the cost of every job that touches it afterward. JSONL is where data is born; Parquet is where it should live; Arrow is how it travels.
Row vs columnar — the one distinction that matters
A dataset is a table of records. There are two ways to lay it on disk, and they have opposite performance profiles.
Row-major (JSONL): {prompt, response, lang, score}{prompt, response, lang, score}...
▶ whole record contiguous. Great to append one row, terrible to read one column.
Columnar (Parquet): [prompt prompt prompt ...][response response ...][lang ...][score ...]
▶ each column contiguous. Read one column without touching the rest. Compresses hard.
Post-training reads are almost always column- and row-selective: "give me the prompt and response columns for rows where lang='en' and score>0.8." Row storage must read every byte of every record to answer that. Columnar storage reads only the columns you asked for, and — because Parquet stores per-chunk min/max statistics — skips entire blocks of rows whose score can't match. That is column projection and predicate pushdown, and together they are why the same query costs 50× less on Parquet.
The three formats you'll actually use
| Format | Layout | Use it for | Cost |
|---|---|---|---|
| JSONL | Row, text | Ingest / interchange. Human-readable, append-only, every tool reads it. The bronze landing format. | No schema, no compression, no pushdown. Full scan every time. |
| Parquet | Columnar, compressed | The silver/gold store. Predicate + column pushdown, splittable for parallel reads, typed schema. | Not human-readable; rewrite to edit; small-file problem if mis-partitioned. |
| Arrow | Columnar, in-memory | The wire/RAM format. Zero-copy hand-off between Spark/Ray/Daft/pandas; the IPC layer. | In-memory representation, not a long-term storage format. |
Partitioning & sharding — physical layout
One 2 TB Parquet file is unusable: you can't read it in parallel and you can't skip parts of it. Two physical-layout knobs fix that:
- Partitioning — split files into directories by a column, e.g.
/lang=en/source=annotation/part-*.parquet. A query filtered onlang='en'reads only that directory; the rest is skipped at the filesystem level (partition pruning). Partition on columns you filter on, with low cardinality. - Sharding / file size — within a partition, aim for files of ~128–512 MB. Too large and you lose parallelism (lesson 05's partitions map to files); too small and you drown in metadata and open-file overhead — the infamous small-file problem.
user_id) creates millions of tiny files — each read needs a separate open, and the metadata alone can dwarf the data. Partition on a handful of low-cardinality columns you actually filter by; let file-size targets handle the rest.
Interactive · what a query actually reads
Pick a format, choose how many columns your job needs and how selective its filter is, and watch how many bytes leave disk. The dataset is fixed at 1 TB with 10 equal-width columns.