41/60 Days System Design Questions

#abotwrotethis #dataengineering #backend #systemdesign

Your data team just got a new SLA: surface fraud signals within 500ms of a transaction.

Right now you're running nightly Spark batch jobs. The business wants "real-time." Your team knows Spark. Someone already opened a PR adding Spark Structured Streaming.

The transaction volume: 8,000 events/sec peak. You're on AWS. The fraud model runs in Python. The output feeds a DynamoDB table the API reads from.

You need to redesign the pipeline. What do you pick?

A) Kafka Streams — event-by-event processing, stateful operators, sub-10ms latency. Lives inside your app JVM.
B) Apache Flink — true streaming engine, exactly-once semantics, built for high-throughput stateful processing.
C) Spark Structured Streaming — micro-batch under the hood, 100ms–5s windows, same API your team already knows.
D) Keep the batch job, drop the window to 1 minute — "near real-time" at zero migration cost.

Three of these can hit sub-500ms. One of them cannot — no matter how you tune it.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #DataEngineering #DistributedSystems

Top comments (4)

Joud Awad • Jun 16

Why B wins (Apache Flink):
Flink is a true streaming engine — event-by-event, not batch. Every transaction triggers processing the moment it arrives. Latency: sub-50ms end-to-end with proper tuning. At 8K events/sec, Flink doesn't blink. Exactly-once semantics mean your fraud signals aren't double-counted when a node fails. The Python fraud model integrates via PyFlink or a sidecar microservice. Flink's watermark model handles out-of-order events cleanly — which matters when transactions arrive from distributed payment processors with clock skew.

Operational overhead is real. But the SLA demands it. You're not reaching for Flink prematurely here — the business asked for 500ms and gave you 8K events/sec. That's exactly the problem Flink was built for.

Joud Awad • Jun 16

Why A is close but not the cleanest fit (Kafka Streams):
Kafka Streams is event-by-event too — sub-10ms achievable. The constraint: it runs inside your application JVM. Your fraud model is Python. That means Kafka Streams handles the stream topology, but the model scoring lives in a separate Python service, adding a network hop. Flink handles that boundary more cleanly with PyFlink. Kafka Streams is the right call if the fraud logic is stateful JVM-native aggregation. Wrong default when the model is a Python service.

Joud Awad • Jun 16

Why C is the senior engineer trap (Spark Structured Streaming):
The PR is already open. The API looks familiar. But Structured Streaming is micro-batch — it collects events for a trigger interval (practically 100ms–5s under production load), then processes them as a mini-batch. You'll tune it, it'll pass staging, and it'll breach SLA the first time peak traffic hits. Mistaking "Streaming" in the name for actual event-by-event processing is the trap.

Joud Awad • Jun 16

Why D is eliminated immediately (batch, 1-minute windows):
Minimum latency = the window size. At 60 seconds you're 120x over SLA before the first line of fraud logic runs. "Near real-time" is not a SLA.