1. Apache Kafka – Complete Guide #
Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events a day. It was originally developed by LinkedIn and later donated to the Apache Software Foundation. Kafka is primarily used for building real-time data pipelines and streaming applications.
2. Core Concepts of Kafka #
2.1 Topics #
- Kafka stores events in categories called topics.
- Each topic is split into partitions to enable parallelism.
- Topics are immutable and append-only.
2.2 Partitions #
- A topic is divided into partitions, which distribute the load.
- Each partition is an ordered, immutable sequence of records.
- Partitions are identified by an offset.
2.3 Producers #
- Producers are applications that publish events to one or more Kafka topics.
- Producers decide which partition an event should go to, using either a round-robin approach or a custom partitioner.
2.4 Consumers #
- Consumers subscribe to one or more topics and process the published events.
- Consumer groups allow multiple consumers to read from a topic in parallel while ensuring each partition is processed by only one consumer.
2.5 Brokers #
- A Kafka broker is a server that stores topic partitions and serves client requests.
- Kafka clusters usually consist of multiple brokers for scalability and fault tolerance.
2.6 Zookeeper #
- Zookeeper manages metadata and coordination for Kafka brokers.
- In newer versions, Kafka has moved towards using KRaft (Kafka Raft) protocol to eliminate the dependency on Zookeeper.
3. Kafka Architecture #
Kafka’s architecture comprises producers, brokers, consumers, and Zookeeper (or KRaft). Key features include:
- Scalability: Easily add brokers and partitions.
- Durability: Messages are persisted to disk and replicated.
- Fault Tolerance: Automatic recovery in case of broker failures.
4. Key Kafka Operations #
4.1 Publishing Messages #
- Producers send messages to a specific topic.
- Messages can include keys, which determine the partition.
4.2 Consuming Messages #
- Consumers poll Kafka for messages.
- Kafka’s at-least-once delivery ensures message delivery even if the consumer crashes.
4.3 Message Retention #
- Kafka retains messages for a configurable period (e.g., 7 days).
- Retention can be configured based on time or log size.
5. Kafka Use Cases #
- Real-time analytics: Processing logs and metrics in real-time.
- Event sourcing: Capturing and storing all changes to an application state.
- Log aggregation: Centralized log collection for analysis.
- Stream processing: Enabling applications to process streams of data in real-time.
6. Kafka Setup and Configuration #
6.1 Prerequisites #
- Java installed (version 8 or higher).
- Sufficient disk space for logs.
6.2 Steps to Set Up Kafka #
- Download Kafka
wget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
- Extract Kafka
tar -xvzf kafka_2.13-3.5.1.tgz cd kafka_2.13-3.5.1
- Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
- Start Kafka Broker
bin/kafka-server-start.sh config/server.properties
7. Kafka Commands #
7.1 Create a Topic #
bin/kafka-topics.sh --create --topic my-topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
7.2 List Topics #
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
7.3 Produce Messages #
bin/kafka-console-producer.sh --topic my-topic --bootstrap-server localhost:9092
7.4 Consume Messages #
bin/kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server localhost:9092
8. Kafka APIs #
- Producer API: Enables applications to send streams of data to topics.
- Consumer API: Allows applications to read data from topics.
- Streams API: Transforms input streams to output streams.
- Connect API: Integrates Kafka with other systems via connectors.
9. Advanced Kafka Concepts #
9.1 Kafka Streams #
- A lightweight library for building stream processing applications.
- Supports stateless and stateful operations, windowing, and joins.
9.2 Kafka Connect #
- Used for integrating Kafka with external systems like databases and file systems.
- Pre-built connectors available for quick integration.
9.3 Partition Rebalancing #
- Occurs when partitions are reassigned to different brokers or consumers.
- Managed automatically by Kafka.
10. Kafka Monitoring and Metrics #
10.1 Key Metrics #
- Broker performance: Latency and throughput.
- Consumer lag: Measures how far a consumer is behind the producer.
- Partition health: Ensures all partitions are assigned and replicated.
10.2 Monitoring Tools #
- Kafka Manager: Provides a web-based interface.
- Prometheus and Grafana: For detailed metric visualization.
11. Kafka Security #
- Authentication: SASL and SSL for secure connections.
- Authorization: ACLs to control access to topics and groups.
- Encryption: SSL/TLS to encrypt data in transit.
12. Hands-On Code Examples #
12.1 Python Producer Example #
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Sending a message to Kafka topic
message = {"key": "value"}
producer.send('my-topic', value=message)
producer.flush()
print("Message sent successfully.")
12.2 Python Consumer Example #
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'my-topic',
bootstrap_servers=['localhost:9092'],
auto_offset_reset='earliest',
enable_auto_commit=True,
value_deserializer=lambda v: json.loads(v.decode('utf-8'))
)
for message in consumer:
print(f"Received message: {message.value}")
12.3 Java Consumer with Polling #
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test-group");
props.put("enable.auto.commit", "true");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("my-topic"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
13. Conclusion #
Apache Kafka is a robust, scalable, and fault-tolerant platform for managing real-time event streams. Its versatile architecture and rich ecosystem make it a vital tool in modern data processing workflows.