Bigdata – Knowledge Base

Apache – Kafka

1. Apache Kafka – Complete Guide #

Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events a day. It was originally developed by LinkedIn and later donated to the Apache Software Foundation. Kafka is primarily used for building real-time data pipelines and streaming applications.


2. Core Concepts of Kafka #

2.1 Topics #

  • Kafka stores events in categories called topics.
  • Each topic is split into partitions to enable parallelism.
  • Topics are immutable and append-only.

2.2 Partitions #

  • A topic is divided into partitions, which distribute the load.
  • Each partition is an ordered, immutable sequence of records.
  • Partitions are identified by an offset.

2.3 Producers #

  • Producers are applications that publish events to one or more Kafka topics.
  • Producers decide which partition an event should go to, using either a round-robin approach or a custom partitioner.

2.4 Consumers #

  • Consumers subscribe to one or more topics and process the published events.
  • Consumer groups allow multiple consumers to read from a topic in parallel while ensuring each partition is processed by only one consumer.

2.5 Brokers #

  • A Kafka broker is a server that stores topic partitions and serves client requests.
  • Kafka clusters usually consist of multiple brokers for scalability and fault tolerance.

2.6 Zookeeper #

  • Zookeeper manages metadata and coordination for Kafka brokers.
  • In newer versions, Kafka has moved towards using KRaft (Kafka Raft) protocol to eliminate the dependency on Zookeeper.

3. Kafka Architecture #

Kafka’s architecture comprises producers, brokers, consumers, and Zookeeper (or KRaft). Key features include:

  • Scalability: Easily add brokers and partitions.
  • Durability: Messages are persisted to disk and replicated.
  • Fault Tolerance: Automatic recovery in case of broker failures.

4. Key Kafka Operations #

4.1 Publishing Messages #

  • Producers send messages to a specific topic.
  • Messages can include keys, which determine the partition.

4.2 Consuming Messages #

  • Consumers poll Kafka for messages.
  • Kafka’s at-least-once delivery ensures message delivery even if the consumer crashes.

4.3 Message Retention #

  • Kafka retains messages for a configurable period (e.g., 7 days).
  • Retention can be configured based on time or log size.

5. Kafka Use Cases #

  1. Real-time analytics: Processing logs and metrics in real-time.
  2. Event sourcing: Capturing and storing all changes to an application state.
  3. Log aggregation: Centralized log collection for analysis.
  4. Stream processing: Enabling applications to process streams of data in real-time.

6. Kafka Setup and Configuration #

6.1 Prerequisites #

  • Java installed (version 8 or higher).
  • Sufficient disk space for logs.

6.2 Steps to Set Up Kafka #

  1. Download Kafkawget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
  2. Extract Kafkatar -xvzf kafka_2.13-3.5.1.tgz cd kafka_2.13-3.5.1
  3. Start Zookeeperbin/zookeeper-server-start.sh config/zookeeper.properties
  4. Start Kafka Brokerbin/kafka-server-start.sh config/server.properties

7. Kafka Commands #

7.1 Create a Topic #

7.2 List Topics #

7.3 Produce Messages #

7.4 Consume Messages #


8. Kafka APIs #

  1. Producer API: Enables applications to send streams of data to topics.
  2. Consumer API: Allows applications to read data from topics.
  3. Streams API: Transforms input streams to output streams.
  4. Connect API: Integrates Kafka with other systems via connectors.

9. Advanced Kafka Concepts #

9.1 Kafka Streams #

  • A lightweight library for building stream processing applications.
  • Supports stateless and stateful operations, windowing, and joins.

9.2 Kafka Connect #

  • Used for integrating Kafka with external systems like databases and file systems.
  • Pre-built connectors available for quick integration.

9.3 Partition Rebalancing #

  • Occurs when partitions are reassigned to different brokers or consumers.
  • Managed automatically by Kafka.

10. Kafka Monitoring and Metrics #

10.1 Key Metrics #

  • Broker performance: Latency and throughput.
  • Consumer lag: Measures how far a consumer is behind the producer.
  • Partition health: Ensures all partitions are assigned and replicated.

10.2 Monitoring Tools #

  • Kafka Manager: Provides a web-based interface.
  • Prometheus and Grafana: For detailed metric visualization.

11. Kafka Security #

  1. Authentication: SASL and SSL for secure connections.
  2. Authorization: ACLs to control access to topics and groups.
  3. Encryption: SSL/TLS to encrypt data in transit.

12. Hands-On Code Examples #

12.1 Python Producer Example #

12.2 Python Consumer Example #

12.3 Java Consumer with Polling #


13. Conclusion #

Apache Kafka is a robust, scalable, and fault-tolerant platform for managing real-time event streams. Its versatile architecture and rich ecosystem make it a vital tool in modern data processing workflows.


What are your feelings
Updated on December 30, 2024