Apache Kafka… Basics to drive

What is Apache Kafka?

Apache Kafka is a distributed event-streaming platform designed to handle real-time data feeds. It allows applications to publish, process, and subscribe to streams of data in a highly scalable, fault-tolerant manner.

Core notions in Kafka

Topics:
- Kafka organizes data into topics, which are similar to tables in a database.
- A topic is divided into partitions, which enable parallel processing.
Producers:
- Producers publish data (events/messages) to Kafka topics.
- Data is written to partitions based on a partitioning key.
Consumers:
- Consumers read data from topics.
- Consumers are part of consumer groups, ensuring load balancing.
Brokers:
- Brokers are the servers that store and serve Kafka data.
- Kafka clusters usually have multiple brokers for scalability and fault tolerance.
Zookeeper/KRaft:
- Zookeeper (or the newer KRaft mode) manages metadata and coordinates the cluster.

How Kafka Works

Producers send messages to Kafka topics. Messages are stored in partitions within the topics.
Consumers fetch messages from these partitions. Kafka retains messages for a configurable retention period.
Data is stored in a log-based structure, ensuring high write and read throughput.

Stream processing involves handling and analyzing data in real time, as it flows through systems. This is different from batch processing, where data is processed in chunks at intervals.

Example usage HDFS — Kafka

Steps to Load HDFS Data into Kafka:

Prepare Data: Split large HDFS files into smaller chunks if necessary, as Kafka messages have a size limit (default: 1 MB).
Producer Tool: Use tools like Kafka Connect (a pluggable data integration framework) or write a custom producer script in Python, Java, etc.
Configuration:
- Define Kafka topic(s) for the data.
- Set partitioning logic (e.g., based on a key or round-robin).
Test and Monitor: Check data flow using tools like Kafka CLI or monitoring dashboards (e.g., Kafka Manager).

Brockers

A Broker is a Kafka server that stores data and serves client requests (from Producers and Consumers).
- Kafka is a distributed system, so it typically consists of multiple brokers, forming a Kafka cluster.
- Key Points:
  - Brokers manage topics, partitions, and message storage.
  - Each partition of a topic resides on one or more brokers, based on the replication factor.
  - One broker in the cluster is elected as the Controller to manage metadata and broker coordination.

Producers

Producers are clients that publish messages to Kafka topics.
How Producers Interact with Brokers:
- Producers connect to one or more brokers (usually via a load balancer or bootstrap server addresses).
- The producer sends messages to a specific topic.
- Partition Assignment:
  - A producer decides which partition a message goes to within the topic.
  - Default Behavior:
    - If a key is provided with the message, Kafka uses it to determine the partition using a hashing algorithm.
    - If no key is provided, Kafka assigns partitions in a round-robin manner.
  - Example:
    - Topic events has 3 partitions: events-0, events-1, events-2.
    - A producer sends a message with key user123. Kafka hashes the key to assign the message to a specific partition (e.g., events-1).

Consumers

Consumers are clients that read messages from Kafka topics.
How Consumers Interact with Brokers:
- Consumers subscribe to one or more topics and consume messages from partitions.
- Kafka ensures that messages are delivered in order within a partition.
- Consumers in a consumer group share the workload:
  - Each consumer in the group is assigned one or more partitions.
  - Example:
    - Topic logs has 3 partitions.
    - Consumer Group group1 has 2 consumers: Consumer A and Consumer B.
    - Consumer A reads from logs-0 and logs-1, while Consumer B reads from logs-2.

Broker Coordination

Kafka uses a leader-follower model for partition replication:
- For each partition, one broker acts as the leader and handles all read and write requests.
- Other brokers store replica data and act as followers.
Producers and consumers always interact with the leader broker for a partition.
Failover:
- If a broker fails, Kafka elects a new leader for the affected partitions from the available replicas.

Interconnections in Action

Producers to Brokers:
- A producer queries the cluster metadata to determine which broker is the leader for a partition.
- The producer sends messages directly to the leader broker for that partition.
Brokers to Consumers:
- Consumers fetch metadata from the cluster to discover which partitions belong to their subscribed topic(s).
- Consumers connect directly to the leader broker(s) for those partitions to fetch data.
Brokers to Each Other:
- Brokers exchange metadata (e.g., partition assignments and replicas) to keep the cluster consistent.
- Replication occurs between brokers to ensure fault tolerance.

Data Flow Example

Producer sends a message:
- Producer connects to the cluster and queries metadata.
- Metadata response identifies the leader broker for the target partition.
- Producer sends the message to the leader broker.
Broker stores the message:
- The leader broker appends the message to the partition’s log.
- Replication: The leader propagates the message to follower brokers.
Consumer reads the message:
- Consumer fetches metadata to identify the leader broker for the assigned partition.
- Consumer connects to the leader broker and fetches messages.
- Kafka tracks the offset of each message to ensure no data is skipped or duplicated.

Key Advantages of These Interconnections

Scalability:
- Adding brokers increases the cluster’s capacity.
- Producers and consumers automatically adjust to the new cluster size.
Fault Tolerance:
- Replication ensures data availability even if a broker fails.
- Consumers can continue reading from replicas if a leader broker goes down.
High Throughput:
- Partitioning enables parallel processing.
- Producers and consumers can write and read from multiple brokers simultaneously.
```
Producers --> [Broker1 (Leader)] <-- Consumers               [Broker2 (Follower)]               [Broker3 (Follower)]
```
  In this example:
  - Producer sends data to Broker1, the leader of a partition.
  - Broker2 and Broker3 replicate the data from Broker1.
  - Consumers fetch data from Broker1 (or one of the followers if configured).

Kafka’s Storage Model

Log-Based Storage:
- Kafka organizes data into topics, and each topic is divided into partitions.
- Each partition is an append-only log stored on disk.
- Messages are written sequentially to the end of the log, making writes extremely fast due to minimal disk seek overhead.
Retention-Based Storage:
- Kafka retains messages for a configurable retention period (e.g., 7 days by default) or until the log reaches a certain size.
- Old messages are automatically deleted after they exceed the retention policy.
- Kafka is not designed for long-term data storage like HDFS; it is optimized for real-time streaming and temporary storage.
File Organization:
- Each partition is stored as a set of files on the broker’s disk.
- These files are segmented for efficient access (e.g., log-0, log-1).
- Kafka indexes these log files to enable fast lookups by message offset.

Kafka vs. HDFS Storage

Feature	Kafka	HDFS
Purpose	Real-time streaming, message queue	Long-term distributed file storage
Data Retention	Configurable (e.g., time/size-based)	Permanent until explicitly deleted
Read/Write Pattern	Append-only log, sequential writes	Random access, distributed writes
Scalability	Horizontal scaling via brokers	Horizontal scaling via DataNodes
Fault Tolerance	Replication at partition level	Replication across DataNodes
Primary Use Case	Real-time data pipelines	Storing large datasets for analysis

How Kafka Handles Storage

Persistence:
- Kafka stores all data on disk, even though it’s a messaging system.
- This ensures durability and allows consumers to replay messages if needed.
Replication:
- Messages in Kafka are replicated across brokers for fault tolerance.
- For example, a topic with a replication factor of 3 will have its data stored on three brokers.
Segmented Storage:
- Kafka splits each partition’s log into segments.
- When a segment reaches a configured size, Kafka creates a new segment.
- Old segments are deleted according to the retention policy.
Compaction (Optional):
- Kafka offers a log compaction feature, which retains only the latest value for each key.
- This is useful for scenarios like updating state or maintaining a compact view of data.

Kafka does have its own storage, but it is designed for short-term storage and high-throughput message delivery. For long-term storage or big data analysis, Kafka is usually paired with systems like HDFS, S3, or other data lakes.

ссылка на оригинал статьи https://habr.com/ru/articles/872976/