Showing posts with label flume. Show all posts
Showing posts with label flume. Show all posts

May 07, 2022

Top 20 Apache Flume Interview Questions and Answers

                    Flume is a standard, simple, robust, versatile, and extendable tool for ingesting data into Hadoop from a variety of data providers (webservers).
Apache Flume is a dependable and distributed log data collection, aggregation, and distribution system. It's a highly available, dependable service with adjustable recovery methods.
The Flume's main goal is to capture streaming data from various web servers and store it in HDFS. Its architecture is basic and adaptable, based on streaming data flows. It is fault-tolerant and provides a fault tolerance and failure recovery mechanism.

Apache Kafka Interview Questions and Answers
Ques. 1): What does Apache Flume stand for?
Apache Flume is an open source platform for collecting, aggregating, and transferring huge amounts of data from one or more sources to a centralised data source effectively and reliably. Flume's data sources can be customised, so it can injest any type of data, such as log data, event data, network data, social media produced data, email messages, message queues, and so on.

Apache Struts 2 Interview Questions and Answers

Ques. 2): Why Flume?
Apart from collecting logs from distributed systems, it is also capable of performing other use cases. like
It Collects readings from array of sensors
Also, it collects impressions from custom apps for an ad network
Moreover, it collects it readings from network devices in order to monitor their performance.
Also, preserves the reliability, scalability, manageability, and extensibility while it serves maximum number of clients with higher QoS.
Apache Spark Interview Questions and Answers 

Ques. 3): What role does Flume play in big data?
Flume is a dependable distributed service for aggregating and collecting massive amounts of streaming data into HDFS. Most big data analysts utilise Apache Flume to deliver data into Hadoop, Strom, Solr, Kafka, and Spark from various sources such as Twitter, Facebook, and LinkedIn.

Ques. 4): What similarities and differences do Apache Flume and Apache Kafka have?
When it comes to Flume, it uses Sinks to send messages to their destinations. However, with Kafka, you must use a Kafka Consumer API to accept messages from the Kafka Broker.

Apache Tomcat Interview Questions and Answers
Ques. 5): What is flume agent, exactly?
A Flume agent is a Java virtual machine (JVM) process that hosts the components that allow events to flow from an external source to the central repository or to the next destination.
For each flume data flow, the Flume agent connects the external sources, Flume sources, Flume Channels, Flume sinks, and external destinations. Flume agent accomplishes this by mapping sources, channels, sinks, and other components, as well as defining characteristics for each component, in a configuration file.

Apache Drill Interview Questions and Answers
Ques. 6): How do you deal with agent errors?
If a Flume agent fails, all flows hosted on that agent are terminated.
Flow will resume once the agent is restarted. All events stored in the chavvels when the agent went down are lost if the channel is set up as an in-memory channel. Channels configured as file or other stable channels, on the other hand, will continue to handle events where they left off.

Apache Ambari interview Questions and Answers
Ques. 7): In Flume, how is recoverability ensured?
Flume organises events and data into channels. Flume sources populate Flume channels with events. Flume sinks consume channel events and publish them to terminal data storage. Failure recovery is handled by channels. Flume supports a variety of channels. In-memory channels save events in an in-memory queue for speedier processing. The local file system backs up file channels, making them durable.

Apache Tapestry Interview Questions and Answers
Ques. 8): What are the Flume's Basic Characteristics?
A Hadoop data gathering service: We can quickly pull data from numerous servers into Hadoop using Flume. For distributed systems, use the following formula: Flume is also used to import massive amounts of event data from social networking sites such as Facebook and Twitter, as well as e-commerce sites such as Amazon and Flipkart. Source code: It is an open-source programme. It can be activated without the use of a licence key. Flume may be resized vertically and horizontally.
1. A flume transports data from sources to sinks. This data collection might be planned or event-driven. Flume features its own query processing engine, which makes it simple to alter each fresh batch of data before sending it to its destination.
2. Apache Flume is horizontally scalable.
3. Apache Flume provides support for large sets of sources, channels, and sinks.
4. With Flume, we can collect data from different web servers in real-time as well as in batch mode.
5. Flume provides the feature of contextual routing.
6. If the read rate exceeds the write rate, Flume provides a steady flow of data between read and write operations.

Apache Ant Interview Questions and Answers
Ques. 9): What exactly is the Flume event?
Flume event is a data unit containing a set of string properties. The source receives events from an external source, such as a web server. Flume contains built-in capabilities to recognise the source format. Avro, for example, delivers events to the Flume from Avro sources.
Each log file is treated as an individual event. Each event has header and value sectors, which contain header information as well as the proper value for each header.

Apache Camel Interview Questions and Answers
Ques. 10): In Flume, explain the replication and multiplexing selections.
Answer: Channel selectors are used to handle many channels. Furthermore, based on the Flume header value, an event can be written to a single channel or numerous channels. If no channel selector is supplied for the source, it defaults to the Replicating selector.

Apache Cassandra Interview Questions and Answers
Ques. 11): What exactly is FlumeNG?
FlumeNG is nothing more than a real-time loader for streaming data into Hadoop. It basically uses HDFS and HBase to store data. As a result, if we wish to start with FlumeNG, we should know that it improves on the original flume.
Using the replicating selection, the same event is written to all of the channels in the source's channels list. We use the Multiplexing channel selection when the application has to broadcast distinct events to multiple channels.

Apache NiFi Interview Questions and Answers
Ques. 12): Could you please clarify what configuration files are?
The configuration of the agent is saved in a local configuration file. It contains information about each agent's source, sink, and channel. Name, type, and set of properties are all properties of each fundamental component, such as source, sink, and channel. To accept data from an external client, an Avro source, for example, requires the hostname and port number. In terms of capacity, the memory channel should have a maximum queue size. Sink should have File System URI, Path to Create Files, File Rotation Frequency, and other settings.

Apache Storm Interview Questions and Answers
Ques. 13): What is topology design in Apache Flume?
The initial step in Apache Flume is to verify all data sources and sinks, after which we may determine whether we need event aggregation or rerouting. When gathering data from multiple sources, aggregation and rerouting are required to redirect those events to a different place.

Ques. 14): Explain about the core components of Flume.
The core components of Flume are –
Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.

Ques. 15): What is the data flow in Flume?
To transport log data into HDFS, we use the Flume framework. The log servers, on the other hand, generate events and log data. Flume agents are also running on these servers. Furthermore, the data generators provide the data to these agents.
To be more explicit, there is an intermediate node in Flume that collects data from these agents; these nodes are referred to as Collectors. There can be several collectors in Flume, just like there can be multiple agents.
After that, data from all of these collectors will be gathered and transferred to a central location. For example, HBase or HDFS. Refer to the Flume Data Flow diagram below for a better understanding of the Flume Data Flow paradigm.

Ques. 16): How can Flume be used with HBase?
Apache Flume can be used with HBase using one of the two HBase sinks –
HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink –
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.
Working of the AsyncHBaseSink-
AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

Ques. 17): What method is used to stream data from the hard drive?
Ans: The data is "streamed" off the hard disc by keeping the drive's maximum I/O rate for these huge blocks of data constant. The write-once, read-many-times pattern is the most efficient data processing pattern, according to HDFS.

Ques. 18): What distinguishes HBaseSink from AsyncHBaseSink?
To deliver the event to the Hbase system, Apache Flume HBaseSink and AsyncHBaseSink are both employed. The HTable API is used to transfer data to HBase in the case of HBaseSink, while the asynchbase API is used to send stream data to HBase in the case of AsyncHBaseSink. The callbacks are responsible for handling any failures.

Ques. 19): In Hadoop HDFS, what is Flume? How can you tell if your sequence data has been imported into HDFS?
It's another Apache Software Foundation top-level project designed to provide continuous data injection in Hadoop HDFS. The data can be any type of data, however Flume is best suited for handling log data, such as web server log data.

Ques. 20): What is the difference between streaming and HDFS?
Ans: Streaming simply means that you can get a continuous bitrate over a specific threshold when sending data, rather than having it come in bursts or waves. If HDFS is set up for streaming, it will very certainly enable seek, albeit with the added overhead of caching data for a steady stream.

January 03, 2022

Top 20 Apache Kafka Interview Questions and Answers


Apache Kafka is a free and open-source streaming platform. Kafka began as a messaging queue at LinkedIn, but it has since grown into much more. It's a flexible tool for working with data streams that may be used in a wide range of situations. Because Kafka is a distributed system, it can scale up and down as needed. All that's left to do now is expand the cluster with new Kafka nodes (servers).

In a short length of time, Kafka can process a big volume of data. It also has a low latency, allowing for real-time data processing. Despite the fact that Apache Kafka is written in Scala and Java, it may be utilised with a wide range of computer languages.

Apache Hive Interview Questions & Answers

Ques. 1): What exactly do you mean when you say "confluent kafka"? What are the benefits?


Confluent is an Apache Kafka-based data streaming platform that can do more than just publish and subscribe. It can also store and process data within the stream. Confluent Kafka is a more extensive version of Apache Kafka. It improves Kafka's integration capabilities by adding tools for optimising and maintaining Kafka clusters, as well as methods for ensuring the security of the streams. Because of the Confluent Platform, Kafka is simple to set up and use. Confluent's software is available in three flavours:

A free, open-source streaming platform that makes working with real-time data streams a breeze;

A premium cloud-based version with more administration, operations, and monitoring features; an enterprise-grade version with more administration, operations, and monitoring tools.

Following are the advantages of Confluent Kafka :

  • It features practically all of Kafka's characteristics, as well as a few extras.
  • It greatly simplifies the administrative operations procedures.
  • It relieves data managers of the burden of thinking about data relaying.

Apache Ambari interview Questions & Answers

Ques. 2): What are some of Kafka's characteristics?


The following are some of Kafka's most notable characteristics:-

  • Kafka is a fault-tolerant messaging system with a high throughput.
  • A Topic is a built-in patriation system in Kafka.
  • Kafka also comes with a replication mechanism.
  • Kafka is a distributed messaging system that can manage massive volumes of data and transfer messages from one sender to another.
  • The messages can also be saved to storage and replicated across the cluster using Kafka.
  • Kafka works with Zookeeper for synchronisation and collaboration with other services.
  • Kafka provides excellent support for Apache Spark.

Apache Tapestry Interview Questions and Answers

Ques. 3): What are some of the real-world usages of Apache Kafka?


The following are some examples of Apache Kafka's real-world applications:

Message Broker: Because Apache Kafka has a high throughput value, it can handle a large number of similar sorts of messages or data. Apache Kafka can be used as a publish-subscribe messaging system that makes it simple to read and publish data.

To keep track of website activity, Apache Kafka can check if data is successfully delivered and received by websites. Apache Kafka is capable of handling the huge volumes of data generated by websites for each page as well as user actions.

To keep track of metrics connected to certain technologies, such as security logs, we can utilise Apache Kafka to monitor operational data.

Data logging: Apache Kafka provides data replication between nodes functionality that can be used to restore data on failed nodes. It can also be used to collect data from various logs and make it available to consumers.

Stream Processing with Kafka: Apache Kafka can also handle streaming data, the data that is read from one topic, processed, and then written to another. Users and applications will have access to a new topic containing the processed data.

Apache NiFi Interview Questions & Answers

Ques. 4): What are some of Kafka's disadvantages?


The following are some of Kafka's drawbacks:

  • When messages are tweaked, Kafka performance suffers. Kafka works well when the message does not need to be updated.
  • Kafka does not support wildcard topic selection. It's crucial to use the appropriate issue name.
  • When dealing with large messages, brokers and consumers degrade Kafka's performance by compressing and decompressing the messages. This has an effect on Kafka's performance and throughput.
  • Kafka does not support several message paradigms, such as point-to-point queues and request/reply.
  • Kafka lacks a comprehensive set of monitoring tools.

Apache Spark Interview Questions & Answers

Ques. 5): What are the use cases of Kafka monitoring?


The following are some examples of Kafka monitoring use cases:

  • Monitor the use of system resources: It can be used to track the usage of system resources like memory, CPU, and disc over time.
  • Threads and JVM consumption should be monitored: To free up memory, Kafka relies on the Java garbage collector, which ensures that it runs frequently, ensuring that the Kafka cluster is more active.
  • Maintain an eye on the broker, controller, and replication statistics so that partition and replica statuses can be changed as needed.
  • Identifying which applications are producing excessive demand and performance bottlenecks may aid in quickly resolving performance issues.


Ques. 6): What is the difference between Kafka and Flume?


Flume's main application is ingesting data into Hadoop. Hadoop's monitoring system, file types, file system, and tools like Morphlines are all incorporated into the Flume. When working with non-relational data sources or streaming a huge file into Hadoop, the Flume is the best option.

Kafka's main use case is as a distributed publish-subscribe messaging system. Kafka was not created with Hadoop in mind, therefore using it to gather and analyse data for Hadoop is significantly more difficult than using Flume.

When a highly reliable and scalable corporate communications system, such as Hadoop, is required, Kafka can be used.


Ques. 7): Explain the terms "leader" and "follower."


In Kafka, each partition has one server that acts as a Leader and one or more servers that operate as Followers. The Leader is in charge of all read and write requests for the partition, while the Followers are responsible for passively replicating the leader. In the case that the Leader fails, one of the Followers will assume leadership. The server's load is balanced as a result of this.


Ques. 8): What are the traditional methods of message transfer? How is Kafka better from them?


The classic techniques of message transmission are as follows: -

Message Queuing: -

The message queuing pattern employs a point-to-point approach. A message in the queue will be discarded once it has been eaten, similar to how a message in the Post Office Protocol is removed from the server once it has been delivered. These queues allow for asynchronous messaging.

If a network difficulty prevents a message from being delivered, such as when a consumer is unavailable, the message will be queued until it is transmitted. As a result, messages aren't always sent in the same order. Instead, they are distributed on a first-come, first-served basis, which in some cases can improve efficiency.

Publisher - Subscriber Model:-

The publish-subscribe pattern entails publishers producing ("publishing") messages in multiple categories and subscribers consuming published messages from the various categories to which they are subscribed. Unlike point-to-point texting, a message is only removed once it has been consumed by all category subscribers.

Kafka caters to a single consumer abstraction, the consumer group, which contains both of the aforementioned. The advantages of adopting Kafka over standard communications transfer mechanisms are as follows:

Scalable: Data is partitioned and streamlined using a cluster of devices, which increases storage capacity.

Faster: A single Kafka broker can handle megabytes of reads and writes per second, allowing it to serve thousands of customers.

Durability and Fault-Tolerant: The data is kept persistent and tolerant to any hardware failures by copying the data in the clusters.


Ques. 9): What is a Replication Tool in Kafka? Explain how to use some of Kafka's replication tools.


The Kafka Replication Tool is used to define the replica management process at a high level. Some of the replication tools available are as follows:

Replica Leader Election Tool of Choice: The Preferred Replica Leader Election Tool distributes partitions to many brokers in a cluster, each of which is known as a replica. The favourite replica is a term used to describe the leader. For various partitions, the brokers generally distribute the leader position fairly across the cluster, but due to failures, planned shutdowns, and other circumstances, an imbalance might develop over time. By reassigning the preferred copies, and hence the leaders, this tool can be utilised to maintain the balance in these instances.

Topics tool: The Kafka topics tool is in charge of all administration operations relating to topics, including:

  • Listing and describing the topics.
  • Topic generation.
  • Modifying Topics.
  • Adding a topic's dividers.
  • Disposing of topics.

Tool to reassign partitions: The replicas assigned to a partition can be changed with this tool. This refers to adding or removing followers from a partition.

StateChangeLogMerger tool: The StateChangeLogMerger tool collects data from brokers in a cluster, formats it into a central log, and aids in the troubleshooting of state change issues. Sometimes there are issues with the election of a leader for a particular partition. This tool can be used to figure out what's causing the issue.

Change topic configuration tool: used to create new configuration choices, modify current configuration options, and delete configuration options.


Ques. 10):  Explain the four core API architecture that Kafka uses.


Following are the four core APIs that Kafka uses:

Producer API:

The Producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics.

Consumer API:

The Kafka Consumer API allows an application to subscribe to one or more Kafka topics. It also allows the programme to handle streams of records generated in connection with such topics.

Streams API: The Kafka Streams API allows an application to process data in Kafka using a stream processing architecture. This API allows an application to take input streams from one or more topics, process them with streams operations, and then generate output streams to send to one or more topics. In this way, the Streams API allows you to turn input streams into output streams.

Connect API:

The Kafka Connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic.


Ques. 11): Is it possible to utilise Kafka without Zookeeper?


As of version 2.8, Kafka can now be utilised without ZooKeeper. When Kafka 2.8.0 was released in April 2021, we all had the opportunity to check it out without ZooKeeper. This version, however, is not yet ready for production and is missing a few crucial features.

It was not feasible to connect directly to the Kafka broker without using Zookeeper in prior versions. This is because the Zookeeper is unable to fulfil client requests when it is down.


Ques. 12): Explain Kafka's concept of leader and follower.


Each partition in Kafka has one server acting as a Leader and one or more servers acting as Followers. The Leader is in control of the partition's read and write requests, while the Followers are in charge of passively replicating the leader. If the Leader is unable to lead, one of the Followers will take over. As a result, the server's load is balanced.


Ques. 13): In Kafka, what is the function of partitions?


From the standpoint of the Kafka broker, partitions allow a single topic to be partitioned across many servers. This gives you the ability to store more data in a single topic than a single server. If you have three brokers and need to store 10TB of data in a topic, you can create a subject with only one partition and store the entire 10TB on one broker. Another option is to create a three-partitioned topic with 10 TB of data distributed across all brokers. From the consumer's perspective, a partition is a unit of parallelism.


Ques. 14): In Kafka, what do you mean by geo-replication?


Geo-replication is a feature in Kafka that allows you to copy messages from one cluster to a number of other data centres or cloud locations. You can use geo-replication to replicate all of the files and store them all over the world if necessary. Using Kafka's MirrorMaker Tool, we can achieve geo-replication. We can ensure data backup without fail by employing the geo-replication strategy.


Ques. 15): Is Apache Kafka a platform for distributed streaming? What are you going to do with it?


Yes. Apache Kafka is a platform for distributed streaming data. Three critical capabilities are included in a streaming platform:

  • We can easily push records using a distributed streaming infrastructure.
  • It has a large storage capacity and allows us to store a large number of records without difficulty.
  • It assists us in processing records as they arrive.
  • The Kafka technology allows us to do the following:
  • We may create a real-time stream of data pipelines using Apache Kafka to send data between two systems.
  • We could also create a real-time streaming platform that reacts to data.


Ques. 16): What is Apache Kafka Cluster used for?


Apache Kafka Cluster is a messaging system that is used to overcome the challenges of gathering and processing enormous amounts of data. The following are the most important advantages of Apache Kafka Cluster:

We can track web activities using Apache Kafka Cluster by storing/sending events for real-time processes.

We may use this to both alert and report on operational metrics.

We can also use Apache Kafka Cluster to transform data into a common format.

It enables the processing of streaming data to the subjects in real time.

It is currently ruling over some of the most popular programmes such as ActiveMQ, RabbitMQ, AWS, and others due to its outstanding characteristics.


Ques. 17): What is the purpose of the Streams API?


Streams API is an API that allows an application to function as a stream processor, ingesting an input stream from one or more topics and providing an output stream to one or more output topics, as well as effectively changing the input streams to output streams.


Ques. 18): In Kafka, what do you mean by graceful shutdown?


Any broker shutdown or failure will be detected automatically by the Apache cluster. In this case, new leaders will be picked for partitions previously handled by that device. This can occur as a result of a server failure or even when the server is shut down for maintenance or configuration changes. Kafka provides a graceful approach for ending a server rather than killing it when it is shut down on purpose.

When a server is turned off, the following happens:

Kafka guarantees that all of its logs are synced onto a disc to avoid having to perform any log recovery when it is restarted. Purposeful restarts can be sped up since log recovery requires time.

Prior to shutting down, all partitions for which the server is the leader will be moved to the replicas. The leadership transfer will be faster as a result, and the period each partition is inaccessible will be decreased to a few milliseconds.


Ques. 19): In Kafka, what do the terms BufferExhaustedException and OutOfMemoryException mean?


A BufferExhaustedException is thrown when the producer can't assign memory to a record because the buffer is full. If the producer is in non-blocking mode and the pace of production over an extended period of time exceeds the rate at which data is transferred from the buffer, the allocated buffer will be emptied and an exception will be thrown.

An OutOfMemoryException may occur if the consumers send large messages or if the quantity of messages sent increases faster than the rate of downstream processing. As a result, the message queue becomes overburdened, using RAM.


Ques. 20): How will you change the retention time in Kafka at runtime?


A topic's retention time can be configured in Kafka. A topic's default retention time is seven days. While creating a new subject, we can set the retention time. When a topic is generated, the broker's property log.retention.hours are used to set the retention time. When configurations for a currently operating topic need to be modified, must be used.

The right command is determined on the Kafka version in use.

The command to use up to 0.8.2 is --alter.

Use --alter starting with version 0.9.0.