Gathuru_M

Posted on Jun 19

Introduction to Apache Kafka: Shifting from Batch Processing to Real-Time Streaming

#data #architecture #beginners #dataengineering

Batch Processing in data, is the approach where a large amount of information is collected over a period of time and processed as a single unit while Streaming is the approach where this information is processed individually in real-time.

In Batch Processing, whether we are running python scripts manually or scheduling them with Apache Airflow, the core concept is the same: wait for a period of time, collect a chunk of data (like an hour's worth of news or crypto prices), process it, and load it into a database.

Batch processing is recommended for reports, daily dashboards, and historical analysis. But what happens when a business needs answers in milliseconds?

Consider these scenarios:

Ride-Sharing Apps: Uber or Bolt tracking a driver’s GPS coordinates second-by-second to update your ETA.
Financial Security: A bank analyzing a credit card swipe for fraud before the transaction is approved.
E-commerce: Tracking clickstream data (every single scroll, click, and hover a user makes) to instantly update product recommendations on a homepage.

You can't wait an hour for an Airflow DAG to trigger for these use cases. You need Event Streaming, and that is exactly what Apache Kafka is designed to handle.

What is Apache Kafka?

Apache Kafka is a distributed event streaming platform.
Instead of acting like a traditional database where data sits in tables waiting to be queried, Kafka handles data as a continuous, high-speed flow of messages (called events).

It acts as a highly decoupled, fault-tolerant middleman between systems that produce data and systems that need to consume that data.

Core Components of Kafka Architecture

To manage real-time streams at scale, Kafka relies on a few key structural components:

1. Producers and Consumers

Producers: Applications that generate and send data into Kafka. For example, a mobile app sending user location updates, or a microservice publishing transaction details.
Consumers: Applications that subscribe to Kafka to read and process those incoming streams. For example, an analytics system calculating live traffic patterns, or a notification service sending an SMS receipt.

The key point here is that Producers and Consumers are completely independent.
A producer doesn't know or care who is reading its data, which prevents your entire system architecture from becoming a tangled web of direct API connections.

2. Topics and Partitions

Data within Kafka is organized into categories called Topics (similar to a table in a traditional database). If you are tracking a logistics fleet, you might have a topic named truck_gps_coordinates.

To handle massive scale, a single Topic is split into multiple pieces called Partitions.

Partitions are spread across different servers (called Brokers).
This allows Kafka to achieve parallel processing — multiple consumers can read from different partitions of the same topic simultaneously, maximizing throughput.
Inside a partition, every message is appended in a strict chronological order and given a unique sequential ID called an Offset.

3. Offsets: How Kafka Tracks Progress

Unlike a traditional message queue, Kafka doesn't delete messages the moment a consumer reads them. Messages stay in Kafka for a configured amount of time (e.g., 7 days).

Because the data stays put, a consumer uses its Offset pointer like a bookmark to remember exactly which message it read last. If the consumer crashes, once it reboots, it checks its last committed offset and resumes reading exactly where it left off without losing a single event.

Local Setup: Zookeeper and Brokers

When you start practicing with Kafka locally, especially inside a Linux environment like WSL — you quickly realize it requires a bit of infrastructure orchestration to spin up.

Historically, Kafka relies on Apache Zookeeper to act as the coordinator, managing the cluster, tracking which brokers are alive, and electing leaders for partitions.

When launching Kafka via the CLI, you have to spin up Zookeeper first, and then launch your Kafka Broker service(Kafka Server).

Running Zookeeper and Kafka Broker services in terminal

Summary

In Batch (Airflow/Postgres): Data is at rest. We run queries over the static data.
In Streaming (Kafka): Data is in motion. Our application logic stays active, and the data constantly flows through our code.

What's Next?

Understanding the architecture of topics, partitions, and offsets is step one. But as data engineers, we need to programmatically interact with this streaming cluster.

In the next article, we are going to write Python scripts using a Kafka client library. We will build a custom Python Producer to generate stream events and a Python Consumer to read and display those events in real-time inside our setup.

Are you moving into real-time streaming workflows yet, or sticking to batch pipelines? Let's discuss in the comments below!

DEV Community