all_lessons / data_intensive_systems / 02 · data models lesson 2 / 16 · ~10 min

Data Models: Relational, Document, and Graph

A data model is the shape of thought a system makes cheap: tables for joins and constraints, documents for aggregate reads, graphs for relationships that keep moving.

First principle

A data model is an abstraction boundary: it hides storage details while forcing you to express the world in a shape that makes some questions natural and others awkward.

same product fact, three shapes RELATIONAL: users table + orders table + joins DOCUMENT: one order document embeds shipping address + line items GRAPH: user --bought--> item --similar_to--> item --viewed_by--> user The right model is the one whose awkward operations are not on your hot path.

Relational: facts, keys, and joins

The relational model starts from relations: rows with named columns, connected by keys. Its superpower is separation of facts. A user name lives in one table; orders reference the user by key. The database can then join facts at query time and enforce constraints such as uniqueness and foreign-key relationships.

This is powerful when relationships are many-to-many, when the questions change, and when correctness matters. Analytics teams love the relational model because they can ask new questions without rewriting the application. Product teams love it when invariants matter: one user id, one account balance, one canonical model version.

The cost is that joins are work. On one machine, joins are an optimizer problem. Across partitions, they become data movement. A relational schema can be clean and still be slow if the hot query needs to gather rows scattered across machines.

Document: locality by aggregate

A document model stores a nested aggregate together: one user profile, one product catalog item, one order with line items. It is a good fit when the application usually reads and writes the aggregate as a whole. The document becomes the unit of locality.

The trade is clear: document databases reduce impedance mismatch with application objects and can avoid joins for self-contained entities. But they get awkward when many-to-many relationships dominate. If every document embeds a copy of a shared fact, updates fan out. If documents reference each other, the application has reinvented joins without the database optimizer.

For ML systems, documents often appear at ingestion boundaries: raw events, prompts, annotations, model responses, trace spans. They preserve messy source shape. Later, pipelines frequently flatten them into relational or columnar forms for training and analysis.

Graph: relationships as first-class data

Graph models make edges first-class. They are strong when the question is not just "find this entity" but "walk relationships whose shape I cannot predict in advance." Social graphs, dependency graphs, lineage graphs, fraud rings, and knowledge graphs live here.

A graph query is natural when relationships are deep, recursive, and selective: friends of friends, packages depending on a vulnerable library, documents connected by citations, users sharing devices and payment instruments. Representing the same thing relationally is possible, but recursive joins become clumsy. Representing it as documents often duplicates too much.

The cost is operational and mental. Graph systems can be harder to scale horizontally because traversals do not respect neat shard boundaries. Many production systems therefore keep the system of record relational and build graph-shaped derived indexes for specific traversals.

Trade-offs

ChoiceBuysCosts
RelationalConstraints, joins, ad-hoc queries, normalized truthCross-shard joins and schema changes require care
DocumentLocal aggregate reads, flexible schema, easy source ingestionMany-to-many relationships and duplicated facts get awkward
GraphDeep relationship traversal and evolving relationshipsPartitioning traversals and operating graph indexes is harder
Vector / embedding indexSemantic similarity over unstructured dataApproximate, derived, hard to reason about correctness as truth
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If the data model does not match the access pattern, complexity leaks upward. The app starts manually joining documents, manually enforcing constraints, or manually walking relationships the store cannot express.

Design prompts

  1. Model a RAG corpus. Which parts are documents, relations, graphs, and vectors?
  2. When would you keep a graph as a derived view instead of the system of record?
  3. Give one normalized and one denormalized model for a recommendation event.