Stream Processing: Logs, CDC, Event Time, Windows, and Fault Tolerance
Streams are batch processing without an end-of-file. The hard parts are ordering, time, replay, duplicates, state, and deciding when a window is complete enough.
A stream processor continuously derives state from an unbounded log. Correctness depends on replayable input, deterministic processing, and explicit time semantics.
Queues vs logs
Message queues assign work to consumers and delete messages after acknowledgement. They are good for task distribution where each message should be handled once by one worker. Logs retain ordered messages by partition and let consumers track offsets. They are good for replay, multiple consumers, CDC, and derived views.
That distinction matters. A task queue is a work dispatcher. A log is a durable history. Stream processing needs history because consumers fail, code changes, and derived state must be rebuilt. Kafka-style logs act like the filesystem of streaming systems: append, retain, replay.
CDC and event sourcing
Change data capture turns database changes into a stream. Instead of dual-writing to the database and search index, write to the database, then tail the database log and update derived systems. This reduces the dual-write problem because the database commit is the source of truth.
Event sourcing goes further: the event log is the primary record, and current state is derived by replaying events. This is powerful for auditability and reconstruction, but shifts complexity into event design, versioning, and snapshotting.
Both patterns rely on the same idea: an ordered history of facts lets you build and rebuild derived views.
Event time, windows, and fault tolerance
Processing time is when the system sees an event. Event time is when the event actually happened. Late events make these differ. A stream processor must choose window semantics: when do we emit a count for 10:00-10:05 if an event from 10:03 arrives at 10:07?
Watermarks estimate how complete event time is. Allowed lateness defines how long windows remain open for corrections. The trade is freshness versus correctness. Emit early and correct later, or wait and increase latency.
Fault tolerance requires checkpointing both processing state and log offsets. On restart, the processor reloads state and resumes from a known offset. Exactly-once output is usually achieved by idempotent writes, transactions with offsets, or deterministic replay, not by pretending failures never duplicate work.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Task queue | Simple work distribution and backpressure | No natural replay history for many consumers |
| Partitioned log | Replay, multiple consumers, ordered per key | Offset/state management and retention costs |
| Low-latency windows | Fresh results | Late events cause corrections or missed data |
| Long lateness allowance | More complete event-time results | Higher latency and more state retention |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
If a stream is treated like a reliable callback, failures create duplicates, late events corrupt windows, and derived state cannot be rebuilt when a bug is fixed.
Design prompts
- Use CDC to keep a search index updated. Where can lag appear?
- What is the difference between event time and processing time for feature freshness?
- How would you make stream output idempotent?