all_lessons / data_intensive_systems / 15 · stream processing lesson 15 / 16 · ~13 min

Stream Processing: Logs, CDC, Event Time, Windows, and Fault Tolerance

Streams are batch processing without an end-of-file. The hard parts are ordering, time, replay, duplicates, state, and deciding when a window is complete enough.

First principle

A stream processor continuously derives state from an unbounded log. Correctness depends on replayable input, deterministic processing, and explicit time semantics.

producer -> partitioned log -> stream processor -> derived state | | | +--> checkpoints state + offsets +--> replay lets you rebuild after failure

Queues vs logs

Message queues assign work to consumers and delete messages after acknowledgement. They are good for task distribution where each message should be handled once by one worker. Logs retain ordered messages by partition and let consumers track offsets. They are good for replay, multiple consumers, CDC, and derived views.

That distinction matters. A task queue is a work dispatcher. A log is a durable history. Stream processing needs history because consumers fail, code changes, and derived state must be rebuilt. Kafka-style logs act like the filesystem of streaming systems: append, retain, replay.

CDC and event sourcing

Change data capture turns database changes into a stream. Instead of dual-writing to the database and search index, write to the database, then tail the database log and update derived systems. This reduces the dual-write problem because the database commit is the source of truth.

Event sourcing goes further: the event log is the primary record, and current state is derived by replaying events. This is powerful for auditability and reconstruction, but shifts complexity into event design, versioning, and snapshotting.

Both patterns rely on the same idea: an ordered history of facts lets you build and rebuild derived views.

Event time, windows, and fault tolerance

Processing time is when the system sees an event. Event time is when the event actually happened. Late events make these differ. A stream processor must choose window semantics: when do we emit a count for 10:00-10:05 if an event from 10:03 arrives at 10:07?

Watermarks estimate how complete event time is. Allowed lateness defines how long windows remain open for corrections. The trade is freshness versus correctness. Emit early and correct later, or wait and increase latency.

Fault tolerance requires checkpointing both processing state and log offsets. On restart, the processor reloads state and resumes from a known offset. Exactly-once output is usually achieved by idempotent writes, transactions with offsets, or deterministic replay, not by pretending failures never duplicate work.

Trade-offs

ChoiceBuysCosts
Task queueSimple work distribution and backpressureNo natural replay history for many consumers
Partitioned logReplay, multiple consumers, ordered per keyOffset/state management and retention costs
Low-latency windowsFresh resultsLate events cause corrections or missed data
Long lateness allowanceMore complete event-time resultsHigher latency and more state retention
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If a stream is treated like a reliable callback, failures create duplicates, late events corrupt windows, and derived state cannot be rebuilt when a bug is fixed.

Design prompts

  1. Use CDC to keep a search index updated. Where can lag appear?
  2. What is the difference between event time and processing time for feature freshness?
  3. How would you make stream output idempotent?