Introduction to Apache Kafka

Introduction to Apache Kafka
Photo by Tamar / Unsplash

Apache Kafka is general purpose Publish / Subscribe messaging system. It can also be described as a distributed streaming platform. Comparing Kafka to other modern messaging systems, Kakfa runs as a modern distirbuted system that runs as a cluster. Also Kakfa is a true storage system built to store data as long as needed.

Messages

Unit of data within Kafka is known as a message. From Kakfa perspective it is simply an array of bytes. A message can optionally have metadata associated with it called a key. A key is also simply an array of bytes.

Batches

For efficiency messages are written in batches. A batch is a collection of messages produced for the same topic and to be written in same partition. Batch optimization is done increase throughput and reduce latency. Batches are also compressed to optimize network bandwidth utilization (at cost of computing power)

Schemas

While messages are opaque byte arrays, it is recommended that some sort of schema or structure is defined for messages. (JSON or XML etc...) Apache Avro is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution, and has implementations for many programming languages and environments like JVM, Python, C/C++ etc...One benefit of using Avro is that it supports data types.

Topics and Partitions

Messages are categorized into topics. Topics are additionally broken down into partitions. Messages are written to a partition in an append only fashion and are read in order from beginning to end. Time ordering of messages is only guaranteed for a particular partition of a topic. Partitions is the mechanism in which Kafka provides redundancy and scalability. Each partition can be stored on a different server thus providing horizontal scalability.


Producers and Consumers

Producers

Producers produce new messages. They are called Publishers in Pub/Sub. A message is generated for a specific topic. By default producer should not care what partition the message belongs to. Kafka streaming system will balance messages across all partitions. If producer wants to specify a particular partition, the producer may do so by using the key metadata.

Consumers

Consumers are also called subscribers in Pub/Sub. A consumer subscribes to one or more topic and read messages in the order they are produced. Consumer keeps track of the messages read using an offset. (Additional metadata). Each message in a partition has unqiue offset. Offset is used to track read positions by storing the last read offset in Zookeeper or Kafka itself.

Consumer Groups

One or more consumers work together to consume a topic. (Called consumer group). The group assures that each partition is consumed by only one consumer. But one consumer can consume from multiple partitions (In this way consumers can scale horizontally). If one consumer fails, the group as  whole will rearrange the partitions being consumed to take over from missing consumer.


Brokers

A single Kafka server is called a broker. A broker receives messages from producers assigns offset to them and commits the message to storage on disk. A broker also services consumers. It responds to fetch requests for partitions. A single broker can cater to thousands of partitions and millions of messages per second.

Clusters

Kafka brokers are designed to operate in a cluster. Within a cluster, one broker will also work as a cluster controller. Controller is responsible for administrative operations like assigning partitions to brokers and monitoring broker failures.