Storage Engines I: Logs, Hash Indexes, and B-Trees
A database write becomes durable by entering a log; a read becomes fast by using an index. The first storage fork is whether you overwrite pages or append history.
Storage engines are machines for trading write cost, read cost, space, and recovery time. The log is the primitive that makes durable change cheap.
The append-only log
The simplest durable database is a file you append to. To set k = v, append a record. To recover after a crash, replay the file. Sequential append is the cheapest persistent write pattern, and it gives you an ordered history of changes.
The problem is reads. If the log contains every historical value of k, finding the latest value by scanning from the beginning is too slow. So we add an index: a separate structure mapping key to location in the log. A hash index gives fast point lookups, but only if the keys fit in memory and range queries do not matter.
Compaction keeps the log from growing forever: merge segments and keep only the newest value for each key. This already reveals the storage-engine pattern that will return in LSM trees: write sequentially now, clean up history later.
The write-ahead log
Even page-oriented databases rely on logs. If a B-tree page is overwritten in place and the machine loses power halfway through, the on-disk tree can be corrupted. The write-ahead log solves this by writing the intent to a sequential log before modifying pages. After a crash, replay the log to finish committed changes or roll back incomplete ones.
This is not just an implementation detail. The same ordered log becomes the replication stream, the source for change data capture, and the audit trail for recovery. Once you see "ordered durable changes," you will see the same abstraction inside databases, Kafka, event sourcing, model registries, and workflow engines.
B-trees: sorted pages for reads and ranges
A B-tree stores keys in sorted order across fixed-size pages. A lookup walks from root to leaf, reading a small number of pages even for huge tables. Because keys are sorted, range scans are efficient: find the starting key and walk forward.
The price is write amplification. Updating one record may dirty a whole page; inserts can split pages; maintaining secondary indexes means every write updates more structures. The WAL protects durability, but the engine still does random-ish page work. B-trees are excellent when you need low-latency reads, range queries, and mixed OLTP workloads.
For interview thinking: a B-tree is not "better" than an LSM tree. It pays more at write time to make reads and ranges straightforward. That is the trade.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Append-only log + hash index | Very fast writes and point reads | Memory index, no range scans, compaction required |
| B-tree primary index | Good point reads, excellent ranges, mature transaction support | Random page updates and write amplification |
| Many secondary indexes | More queries become cheap | Every write must update every relevant index |
| Write-ahead logging | Crash recovery and replication source | Extra write path and log management |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
Without an index, reads scan history. Without a log, recovery after partial writes becomes guesswork. Without compaction or page cleanup, storage grows with every old version forever.
Design prompts
- Why does adding an index speed reads but slow writes?
- Why is a WAL useful even if the database stores data in B-tree pages?
- Pick B-tree or append-only hash index for a session store. What assumptions did you make?