Derived Data Capstone: Caches, Indexes, Features, RAG, and Correctness
Modern applications compose specialized systems. The design problem is not choosing one perfect database; it is keeping derived state useful, explainable, and repairable.
Every cache, index, materialized view, embedding table, feature table, and model artifact is derived data unless it is the system of record. Treat it as rebuildable or prove why it is authoritative.
Unbundling the database
No single system is best at every access pattern. A relational database may own transactional truth. Redis may serve hot reads. Elasticsearch may serve text search. A vector index may serve semantic retrieval. A warehouse may serve analytics. A feature store may serve online model inputs. These are not separate truths. They are derived views optimized for different questions.
The architecture succeeds when each derived view has a clear source, update path, lag metric, schema contract, and rebuild path. It fails when copies become mysterious: nobody knows why the search index differs from the database, which embedding model produced a vector, or whether a feature value was computed before or after the label.
RAG as a data-intensive application
A RAG pipeline is a perfect capstone. Documents arrive as messy source data. They are parsed, chunked, enriched, embedded, indexed, searched, reranked, and cited. Each stage produces derived data. The vector index is not truth; it is a lossy, approximate access path over document versions.
The design questions follow the whole track. Encoding: what is the document schema and chunk schema? Storage: where is the source of record? Partitioning: how are tenants and corpora isolated? Replication: where are indexes served? Streaming: are updates CDC-driven or batch rebuilt? Consistency: after document deletion, how quickly must retrieval stop returning it? Observability: can an answer cite the exact document version and embedding model?
Correctness, integrity, and privacy
Derived data creates correctness obligations. If a user deletes data, every derived copy must be deleted or made unreachable. If a feature is used for training, its event-time computation must avoid label leakage. If a cache is stale, the product must tolerate it. If a materialized view is wrong, the system must detect and rebuild it.
Privacy is not a side note. Data lineage is security infrastructure: you cannot honor deletion, retention, access control, or audit requirements if you do not know where data flowed. The same lineage that debugs ML regressions also tells you which derived views contain sensitive records.
The final rule: every derived view needs one of three stories: it is synchronously maintained under a strong invariant; it is asynchronously maintained with known lag and idempotent updates; or it is periodically recomputed and verified from the source of truth.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Synchronous derived update | Fresh and invariant-friendly | Slower writes and cross-system coupling |
| Async CDC update | Decoupled and scalable | Lag, duplicates, and replay handling required |
| Periodic batch rebuild | Simple correctness reset and reproducibility | Stale between runs and expensive at scale |
| Approximate index | Fast semantic/search access | Recall/precision trade-offs and harder correctness semantics |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
When derived data becomes untraceable, the system loses the ability to answer the most important questions: why did the user see this, what source produced it, can we delete it, and can we rebuild it correctly?
Design prompts
- Design a RAG ingestion pipeline with source-of-truth, derived views, lag metrics, and deletion handling.
- Which derived states in a recommendation system can be stale? Which cannot?
- How would you verify that an online feature store matches offline training features?