Apache Kafka is
a free and open-source streaming platform. Kafka began as a messaging queue at
LinkedIn, but it has since grown into much more. It's a flexible tool for
working with data streams that may be used in a wide range of situations. Because
Kafka is a distributed system, it can scale up and down as needed. All that's
left to do now is expand the cluster with new Kafka nodes (servers).
In a short length of time, Kafka
can process a big volume of data. It also has a low latency, allowing for
real-time data processing. Despite the fact that Apache Kafka is written in
Scala and Java, it may be utilised with a wide range of computer languages.
Apache Hive Interview Questions & Answers
Ques. 1): What exactly do you mean when you say
"confluent kafka"? What are the benefits?
Answer:
Confluent is an Apache Kafka-based data streaming platform
that can do more than just publish and subscribe. It can also store and process
data within the stream. Confluent Kafka is a more extensive version of Apache
Kafka. It improves Kafka's integration capabilities by adding tools for
optimising and maintaining Kafka clusters, as well as methods for ensuring the
security of the streams. Because of the Confluent Platform, Kafka is simple to
set up and use. Confluent's software is available in three flavours:
A free, open-source streaming platform that makes working
with real-time data streams a breeze;
A premium cloud-based version with more administration,
operations, and monitoring features; an enterprise-grade version with more
administration, operations, and monitoring tools.
Following are the advantages of Confluent Kafka :
- It features practically all of Kafka's characteristics, as
well as a few extras.
- It greatly simplifies the administrative operations
procedures.
- It relieves data managers of the burden of thinking about
data relaying.
Apache Ambari interview Questions & Answers
Ques. 2): What are some of Kafka's characteristics?
Answer:
The following are some of Kafka's most notable
characteristics:-
- Kafka is a fault-tolerant messaging system with a high
throughput.
- A Topic is a built-in patriation system in Kafka.
- Kafka also comes with a replication mechanism.
- Kafka is a distributed messaging system that can manage
massive volumes of data and transfer messages from one sender to another.
- The messages can also be saved to storage and replicated
across the cluster using Kafka.
- Kafka works with Zookeeper for synchronisation and
collaboration with other services.
- Kafka provides excellent support for Apache Spark.
Apache Tapestry Interview Questions and Answers
Ques. 3): What are some of the real-world usages of Apache
Kafka?
Answer:
The following are some examples of Apache Kafka's real-world
applications:
Message Broker: Because Apache Kafka has
a high throughput value, it can handle a large number of similar sorts of
messages or data. Apache Kafka can be used as a publish-subscribe messaging
system that makes it simple to read and publish data.
To keep track of website activity, Apache Kafka can check if
data is successfully delivered and received by websites. Apache Kafka is
capable of handling the huge volumes of data generated by websites for each
page as well as user actions.
To keep track of metrics connected to certain technologies,
such as security logs, we can utilise Apache Kafka to monitor operational data.
Data logging: Apache Kafka provides data replication between
nodes functionality that can be used to restore data on failed nodes. It can
also be used to collect data from various logs and make it available to
consumers.
Stream Processing with Kafka: Apache Kafka can also handle
streaming data, the data that is read from one topic, processed, and then
written to another. Users and applications will have access to a new topic
containing the processed data.
Apache NiFi Interview Questions & Answers
Ques. 4): What are some of Kafka's disadvantages?
Answer:
The following are some of Kafka's drawbacks:
- When messages are tweaked, Kafka performance suffers. Kafka
works well when the message does not need to be updated.
- Kafka does not support wildcard topic selection. It's
crucial to use the appropriate issue name.
- When dealing with large messages, brokers and consumers
degrade Kafka's performance by compressing and decompressing the messages. This
has an effect on Kafka's performance and throughput.
- Kafka does not support several message paradigms, such as
point-to-point queues and request/reply.
- Kafka lacks a comprehensive set of monitoring tools.
Apache Spark Interview Questions & Answers
Ques. 5): What are the use cases of Kafka monitoring?
Answer:
The following are some examples of Kafka monitoring use cases:
- Monitor the use of system resources: It can be used to track
the usage of system resources like memory, CPU, and disc over time.
- Threads and JVM consumption should be monitored: To free up
memory, Kafka relies on the Java garbage collector, which ensures that it runs
frequently, ensuring that the Kafka cluster is more active.
- Maintain an eye on the broker, controller, and replication
statistics so that partition and replica statuses can be changed as needed.
- Identifying which applications are producing excessive
demand and performance bottlenecks may aid in quickly resolving performance
issues.
Ques. 6): What is the difference between Kafka and Flume?
Answer:
Flume's main application is ingesting data into Hadoop.
Hadoop's monitoring system, file types, file system, and tools like Morphlines
are all incorporated into the Flume. When working with non-relational data
sources or streaming a huge file into Hadoop, the Flume is the best option.
Kafka's main use case is as a distributed publish-subscribe
messaging system. Kafka was not created with Hadoop in mind, therefore using it
to gather and analyse data for Hadoop is significantly more difficult than
using Flume.
When a highly reliable and scalable corporate communications
system, such as Hadoop, is required, Kafka can be used.
Ques. 7): Explain the terms "leader" and
"follower."
Answer:
In Kafka, each partition has one server that acts as a
Leader and one or more servers that operate as Followers. The Leader is in
charge of all read and write requests for the partition, while the Followers
are responsible for passively replicating the leader. In the case that the
Leader fails, one of the Followers will assume leadership. The server's load is
balanced as a result of this.
Ques. 8): What are the traditional methods of message
transfer? How is Kafka better from them?
Answer:
The classic techniques of message transmission are as
follows: -
Message Queuing: -
The message queuing pattern employs a point-to-point
approach. A message in the queue will be discarded once it has been eaten,
similar to how a message in the Post Office Protocol is removed from the server
once it has been delivered. These queues allow for asynchronous messaging.
If a network difficulty prevents a message from being
delivered, such as when a consumer is unavailable, the message will be queued
until it is transmitted. As a result, messages aren't always sent in the same
order. Instead, they are distributed on a first-come, first-served basis, which
in some cases can improve efficiency.
Publisher - Subscriber Model:-
The publish-subscribe pattern entails publishers producing
("publishing") messages in multiple categories and subscribers
consuming published messages from the various categories to which they are
subscribed. Unlike point-to-point texting, a message is only removed once it
has been consumed by all category subscribers.
Kafka caters to a single consumer abstraction, the consumer
group, which contains both of the aforementioned. The advantages of adopting
Kafka over standard communications transfer mechanisms are as follows:
Scalable: Data is partitioned and streamlined using a
cluster of devices, which increases storage capacity.
Faster: A single Kafka broker can handle megabytes of reads
and writes per second, allowing it to serve thousands of customers.
Durability and Fault-Tolerant: The data is kept persistent
and tolerant to any hardware failures by copying the data in the clusters.
Ques. 9): What is a Replication Tool in Kafka? Explain how
to use some of Kafka's replication tools.
Answer:
The Kafka Replication Tool is used to define the replica
management process at a high level. Some of the replication tools available are
as follows:
Replica Leader Election Tool of Choice: The Preferred
Replica Leader Election Tool distributes partitions to many brokers in a
cluster, each of which is known as a replica. The favourite replica is a term
used to describe the leader. For various partitions, the brokers generally
distribute the leader position fairly across the cluster, but due to failures,
planned shutdowns, and other circumstances, an imbalance might develop over
time. By reassigning the preferred copies, and hence the leaders, this tool can
be utilised to maintain the balance in these instances.
Topics tool: The Kafka topics tool is in charge of all
administration operations relating to topics, including:
- Listing and describing the topics.
- Topic generation.
- Modifying Topics.
- Adding a topic's dividers.
- Disposing of topics.
Tool to reassign partitions: The replicas assigned to a
partition can be changed with this tool. This refers to adding or removing
followers from a partition.
StateChangeLogMerger tool: The StateChangeLogMerger tool
collects data from brokers in a cluster, formats it into a central log, and
aids in the troubleshooting of state change issues. Sometimes there are issues
with the election of a leader for a particular partition. This tool can be used
to figure out what's causing the issue.
Change topic configuration tool: used to create new
configuration choices, modify current configuration options, and delete
configuration options.
Ques. 10): Explain
the four core API architecture that Kafka uses.
Answer:
Following are the four core APIs that Kafka uses:
Producer API:
The Producer API in Kafka allows an application to publish a
stream of records to one or more Kafka topics.
Consumer API:
The Kafka Consumer API allows an application to subscribe to
one or more Kafka topics. It also allows the programme to handle streams of
records generated in connection with such topics.
Streams API: The Kafka Streams API allows an application to
process data in Kafka using a stream processing architecture. This API allows
an application to take input streams from one or more topics, process them with
streams operations, and then generate output streams to send to one or more
topics. In this way, the Streams API allows you to turn input streams into
output streams.
Connect API:
The Kafka Connector API connects Kafka topics to
applications. This opens up possibilities for constructing and managing the
operations of producers and consumers, as well as establishing reusable links
between these solutions. A connector, for example, may capture all database
updates and ensure that they are made available in a Kafka topic.
Ques. 11): Is it possible to utilise Kafka without
Zookeeper?
Answer:
As of version 2.8, Kafka can now be utilised without
ZooKeeper. When Kafka 2.8.0 was released in April 2021, we all had the
opportunity to check it out without ZooKeeper. This version, however, is not
yet ready for production and is missing a few crucial features.
It was not feasible to connect directly to the Kafka broker
without using Zookeeper in prior versions. This is because the Zookeeper is
unable to fulfil client requests when it is down.
Ques. 12): Explain Kafka's concept of leader and follower.
Answer:
Each partition in Kafka has one server acting as a Leader
and one or more servers acting as Followers. The Leader is in control of the
partition's read and write requests, while the Followers are in charge of
passively replicating the leader. If the Leader is unable to lead, one of the
Followers will take over. As a result, the server's load is balanced.
Ques. 13): In Kafka, what is the function of partitions?
Answer:
From the standpoint of the Kafka broker, partitions allow a
single topic to be partitioned across many servers. This gives you the ability
to store more data in a single topic than a single server. If you have three
brokers and need to store 10TB of data in a topic, you can create a subject
with only one partition and store the entire 10TB on one broker. Another option
is to create a three-partitioned topic with 10 TB of data distributed across
all brokers. From the consumer's perspective, a partition is a unit of
parallelism.
Ques. 14): In Kafka, what do you mean by geo-replication?
Answer:
Geo-replication is a feature in Kafka that allows you to
copy messages from one cluster to a number of other data centres or cloud
locations. You can use geo-replication to replicate all of the files and store
them all over the world if necessary. Using Kafka's MirrorMaker Tool, we can
achieve geo-replication. We can ensure data backup without fail by employing
the geo-replication strategy.
Ques. 15): Is Apache Kafka a platform for distributed
streaming? What are you going to do with it?
Answer:
Yes. Apache Kafka is a platform for distributed streaming
data. Three critical capabilities are included in a streaming platform:
- We can easily push records using a distributed streaming
infrastructure.
- It has a large storage capacity and allows us to store a
large number of records without difficulty.
- It assists us in processing records as they arrive.
- The Kafka technology allows us to do the following:
- We may create a real-time stream of data pipelines using
Apache Kafka to send data between two systems.
- We could also create a real-time streaming platform that
reacts to data.
Ques. 16): What is Apache Kafka Cluster used for?
Answer:
Apache Kafka Cluster is a messaging system that is used to
overcome the challenges of gathering and processing enormous amounts of data.
The following are the most important advantages of Apache Kafka Cluster:
We can track web activities using Apache Kafka Cluster by
storing/sending events for real-time processes.
We may use this to both alert and report on operational
metrics.
We can also use Apache Kafka Cluster to transform data into
a common format.
It enables the processing of streaming data to the subjects
in real time.
It is currently ruling over some of the most popular
programmes such as ActiveMQ, RabbitMQ, AWS, and others due to its outstanding
characteristics.
Ques. 17): What is the purpose of the Streams API?
Answer:
Streams API is an API that allows an application to function
as a stream processor, ingesting an input stream from one or more topics and
providing an output stream to one or more output topics, as well as effectively
changing the input streams to output streams.
Ques. 18): In Kafka, what do you mean by graceful shutdown?
Answer:
Any broker shutdown or failure will be detected
automatically by the Apache cluster. In this case, new leaders will be picked
for partitions previously handled by that device. This can occur as a result of
a server failure or even when the server is shut down for maintenance or
configuration changes. Kafka provides a graceful approach for ending a server
rather than killing it when it is shut down on purpose.
When a server is turned off, the following happens:
Kafka guarantees that all of its logs are synced onto a disc
to avoid having to perform any log recovery when it is restarted. Purposeful
restarts can be sped up since log recovery requires time.
Prior to shutting down, all partitions for which the server
is the leader will be moved to the replicas. The leadership transfer will be
faster as a result, and the period each partition is inaccessible will be
decreased to a few milliseconds.
Ques. 19): In Kafka, what do the terms
BufferExhaustedException and OutOfMemoryException mean?
Answer:
A BufferExhaustedException is thrown when the producer can't
assign memory to a record because the buffer is full. If the producer is in
non-blocking mode and the pace of production over an extended period of time
exceeds the rate at which data is transferred from the buffer, the allocated
buffer will be emptied and an exception will be thrown.
An OutOfMemoryException may occur if the consumers send
large messages or if the quantity of messages sent increases faster than the
rate of downstream processing. As a result, the message queue becomes
overburdened, using RAM.
Ques. 20): How will you change the retention time in Kafka
at runtime?
Answer:
A topic's retention time can be configured in Kafka. A
topic's default retention time is seven days. While creating a new subject, we
can set the retention time. When a topic is generated, the broker's property
log.retention.hours are used to set the retention time. When configurations for
a currently operating topic need to be modified, kafka-topic.sh must be used.
The right command is determined on the Kafka version in use.
The command to use up to 0.8.2 is kafka-topics.sh --alter.
Use kafka-configs.sh --alter starting with version 0.9.0.