all_lessons / data_intensive_systems / 06 · LSM and columnar lesson 6 / 16 · ~12 min

Storage Engines II: LSM Trees and Columnar Storage

LSM trees make writes sequential and push cleanup into compaction; column stores make analytical reads cheap by arranging bytes by question, not by object.

First principle

When data volume grows, physical layout becomes the algorithm. Put bytes in the order the workload will consume them.

LSM write path: write -> WAL + memtable -> immutable SSTable -> compaction Columnar scan: read only columns needed skip row fields that do not answer the query compress repeated values together

LSM tree: append first, merge later

An LSM tree turns random writes into sequential writes. New writes enter an in-memory sorted structure called a memtable and are appended to a WAL for durability. When the memtable fills, it flushes to disk as an immutable sorted file: an SSTable. Background compaction merges SSTables, discards overwritten values, and organizes levels.

This is beautiful for write-heavy workloads: ingest events, time-series data, logs, high-volume key-value writes. The write path avoids in-place page updates. But the cost is moved elsewhere. Reads may need to check multiple files. Bloom filters reduce unnecessary file probes, but cannot eliminate read amplification. Compaction consumes disk bandwidth and CPU, sometimes exactly when the system is already busy.

LSM thinking is also useful outside databases. Any system that appends immutable segments and later compacts them is making the same bet: make writes easy, clean up later.

Columnar storage: analytics reads columns, not rows

OLTP reads often fetch a small number of complete records. Analytics reads often scan millions of rows but only a few columns. Row storage is wasteful for that workload: reading a user's full record just to compute average age reads bytes that do not answer the query.

This is the OLTP/OLAP split in physical form. OLTP systems optimize many small, user-facing transactions: point lookups, short updates, index probes, and low tail latency. OLAP systems optimize fewer but much larger analytical scans: aggregates, joins, column projections, and throughput over huge snapshots. The same logical table can need two physical homes because the questions consume bytes differently.

Columnar storage groups values by column. That enables column pruning, compression, vectorized execution, and predicate pushdown. If a query needs country and revenue, it can skip bio, avatar_url, and debug_payload. Repeated values compress well because similar data sits together.

The cost is writes and point updates. Reconstructing one row means reading from many column chunks. Updating one row in place is awkward, so lakehouse formats use immutable files plus metadata, compaction, and delete vectors. Again: append and compact returns.

Choosing by workload

Use the workload vocabulary from lesson 01. Is the bottleneck random writes, point reads, range scans, or large analytical scans? Are queries known in advance? Is freshness seconds or hours? Is storage immutable? Are updates frequent?

Feature stores often use both worlds: online features in low-latency key-value storage, offline training features in columnar Parquet. Search systems use logs for ingestion, LSM-like indexes for update-heavy terms, and compressed segments for query-time scans. The architecture is a set of physical-layout choices, not a brand name.

Trade-offs

ChoiceBuysCosts
LSM treeHigh write throughput, sequential I/O, good compression after compactionRead amplification, compaction debt, temporary space amplification
B-treeStable reads and ranges, simple point lookup pathRandom write/page update cost
Columnar filesFast scans, compression, cheap analyticsSlow point updates and row reconstruction
Row storageFast point reads and writes of whole recordsAnalytical scans read many unnecessary bytes
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If layout ignores workload, scale problems look mysterious. In reality the system is just reading the wrong bytes, rewriting the wrong pages, or compacting at the wrong time.

Design prompts

  1. Why does an LSM tree need compaction? What happens if compaction falls behind?
  2. Why is Parquet good for training-data scans but not ideal for online feature serving?
  3. Design online and offline storage for a feature used by a fraud model.