Apache – Kafka – codeIn [Spark]

1. Apache Kafka – Complete Guide #

Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events a day. It was originally developed by LinkedIn and later donated to the Apache Software Foundation. Kafka is primarily used for building real-time data pipelines and streaming applications.

2. Core Concepts of Kafka #

2.1 Topics #

Kafka stores events in categories called topics.
Each topic is split into partitions to enable parallelism.
Topics are immutable and append-only.

2.2 Partitions #

A topic is divided into partitions, which distribute the load.
Each partition is an ordered, immutable sequence of records.
Partitions are identified by an offset.

2.3 Producers #

Producers are applications that publish events to one or more Kafka topics.
Producers decide which partition an event should go to, using either a round-robin approach or a custom partitioner.

2.4 Consumers #

Consumers subscribe to one or more topics and process the published events.
Consumer groups allow multiple consumers to read from a topic in parallel while ensuring each partition is processed by only one consumer.

2.5 Brokers #

A Kafka broker is a server that stores topic partitions and serves client requests.
Kafka clusters usually consist of multiple brokers for scalability and fault tolerance.

2.6 Zookeeper #

Zookeeper manages metadata and coordination for Kafka brokers.
In newer versions, Kafka has moved towards using KRaft (Kafka Raft) protocol to eliminate the dependency on Zookeeper.

3. Kafka Architecture #

Kafka’s architecture comprises producers, brokers, consumers, and Zookeeper (or KRaft). Key features include:

Scalability: Easily add brokers and partitions.
Durability: Messages are persisted to disk and replicated.
Fault Tolerance: Automatic recovery in case of broker failures.

4. Key Kafka Operations #

4.1 Publishing Messages #

Producers send messages to a specific topic.
Messages can include keys, which determine the partition.

4.2 Consuming Messages #

Consumers poll Kafka for messages.
Kafka’s at-least-once delivery ensures message delivery even if the consumer crashes.

4.3 Message Retention #

Kafka retains messages for a configurable period (e.g., 7 days).
Retention can be configured based on time or log size.

5. Kafka Use Cases #

Real-time analytics: Processing logs and metrics in real-time.
Event sourcing: Capturing and storing all changes to an application state.
Log aggregation: Centralized log collection for analysis.
Stream processing: Enabling applications to process streams of data in real-time.

6. Kafka Setup and Configuration #

6.1 Prerequisites #

Java installed (version 8 or higher).
Sufficient disk space for logs.

6.2 Steps to Set Up Kafka #

Download Kafkawget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
Extract Kafkatar -xvzf kafka_2.13-3.5.1.tgz cd kafka_2.13-3.5.1
Start Zookeeperbin/zookeeper-server-start.sh config/zookeeper.properties
Start Kafka Brokerbin/kafka-server-start.sh config/server.properties

7. Kafka Commands #

7.1 Create a Topic #

bin/kafka-topics.sh --create --topic my-topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

7.2 List Topics #

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

7.3 Produce Messages #

bin/kafka-console-producer.sh --topic my-topic --bootstrap-server localhost:9092

7.4 Consume Messages #

bin/kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server localhost:9092

8. Kafka APIs #

Producer API: Enables applications to send streams of data to topics.
Consumer API: Allows applications to read data from topics.
Streams API: Transforms input streams to output streams.
Connect API: Integrates Kafka with other systems via connectors.

9. Advanced Kafka Concepts #

9.1 Kafka Streams #

A lightweight library for building stream processing applications.
Supports stateless and stateful operations, windowing, and joins.

9.2 Kafka Connect #

Used for integrating Kafka with external systems like databases and file systems.
Pre-built connectors available for quick integration.

9.3 Partition Rebalancing #

Occurs when partitions are reassigned to different brokers or consumers.
Managed automatically by Kafka.

10. Kafka Monitoring and Metrics #

10.1 Key Metrics #

Broker performance: Latency and throughput.
Consumer lag: Measures how far a consumer is behind the producer.
Partition health: Ensures all partitions are assigned and replicated.

10.2 Monitoring Tools #

Kafka Manager: Provides a web-based interface.
Prometheus and Grafana: For detailed metric visualization.

11. Kafka Security #

Authentication: SASL and SSL for secure connections.
Authorization: ACLs to control access to topics and groups.
Encryption: SSL/TLS to encrypt data in transit.

12. Hands-On Code Examples #

12.1 Python Producer Example #

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Sending a message to Kafka topic
message = {"key": "value"}
producer.send('my-topic', value=message)
producer.flush()
print("Message sent successfully.")

12.2 Python Consumer Example #

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'my-topic',
    bootstrap_servers=['localhost:9092'],
    auto_offset_reset='earliest',
    enable_auto_commit=True,
    value_deserializer=lambda v: json.loads(v.decode('utf-8'))
)

for message in consumer:
    print(f"Received message: {message.value}")

12.3 Java Consumer with Polling #

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test-group");
props.put("enable.auto.commit", "true");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("my-topic"));

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
    }
}

13. Conclusion #

Apache Kafka is a robust, scalable, and fault-tolerant platform for managing real-time event streams. Its versatile architecture and rich ecosystem make it a vital tool in modern data processing workflows.

Bigdata – Knowledge Base

Apache – Kafka