Apache
Storm is a Clojure-based distributed stream processing platform that is free
and open-source. The project was founded by Nathan Marz and the BackType team,
and it was open-sourced after Twitter acquired it. Storm makes it simple to
reliably process unbounded streams of data, resulting in real-time processing
instead of batch processing as Hadoop does. Storm is simple to use and may be
used with a variety of computer programming languages.
Ques.
1): Where would you put Apache Storm to good use?
Answer:
Storm
is used for: Stream processing—Apache Storm is used to process real-time
streams of data and update many databases. The input data must be processed at
a rate that is equal to that of the input data. Apache Storm's Distributed RPC
may parallelize a complex query, allowing it to be computed in real time.
Continuous computation—Data streams are processed in real time, and Storm
displays the results to clients. This may necessitate processing each message
as it arrives or developing it in small batches over a short period of time.
Continuous computation is demonstrated by streaming popular themes from Twitter
into web browsers. Real-time analytics: Apache Storm will evaluate and respond
to data as it arrives in real-time from multiple data sources.
Ques.
2): What exactly is Apache Storm? What are the Storm components?
Answer:
Apache
Storm is an open source distributed real-time compute system for real-time
large data analytics processing. Apache Storm, unlike Hadoop, supports
real-time processing and may be utilised with any programming language.
Apache
Storm contains the following components:
Nimbus:
It functions as a Job Tracker for Hadoop. It distributes code across the
cluster, uploads computation for execution, assigns workers to the cluster, and
monitors and reallocates workers as needed.
Zookeeper:
It serves as a communication intermediary between the Storm Cluster Supervisor
and the Zookeeper. Through Zookeeper, it interacts with Nimbus and runs the
procedure based on the signals received from Nimbus.
Ques.
3): Storm's Codebase has how many unique layers?
Answer:
Storm's
codebase is divided into three tiers.
First:
Storm was built from the ground up to work with a variety of languages.
Topologies are defined as Thrift structures in Nimbus, which is a Thrift
service. Storm may be utilised from any language thanks to the use of Thrift.
Second:
Storm specifies all of its interfaces as Java interfaces. So, despite the fact
that Storm's implementation contains a lot of Clojure, all usage must go
through the Java API. This means that Storm's whole feature set is always
accessible via Java.
Third:
Storm’s implementation is largely in Clojure. Line-wise, Storm is about half
Java code, half Clojure code. But Clojure is much more expressive, so in
reality, the great majority of the implementation logic is in Clojure.
Apache Hive Interview Questions and Answers
Ques.
4): What are Storm Topologies and How Do They Work?
Answer:
A
Storm topology contains the philosophy for a real-time application. Storm is
similar to MapReduce in terms of topology. A key distinction is that a
MapReduce job has a finish, whereas a topology can go on indefinitely (or until
you kill it, of course). A topology is a graph made up of spouts, bolts, and
stream groupings.
Apache Tomcat Interview Questions and Answers
5)
What do you mean when you say "nodes"?
Answer:
The
Master Node and the Worker Node are the two types of nodes. The Master Node
manages the Nimbus daemon, which assigns work to devices and monitors their
performance. The Supervisor daemon, which runs on the Worker Node, assigns
responsibilities to other worker nodes and supervises them as needed.
Apache Ambari interview Questions and Answers
Ques.
6): Explain how Apache Storm processes a message completely.
Answer:
Storm
requests a tuple from the Spout by invoking the nextTuple function or method on
the Spout. To discharge a tuple to one of its output streams, the Spout uses
the SpoutoutputCollector provided in the open method. The Spout assigns a
"message id" to each tuple it discharges, which will be used to
recognise the tuple afterwards.
The
tuple is then transmitted to consuming bolts, and storm is in charge of
tracking the message tree that is generated. If the storm is certain that a
tuple has been properly digested, it can invoke the ack operation on the
original Spout job, passing the message id that the Spout has provided to the
Storm.
Apache Tapestry Interview Questions and Answers
Ques.
7): When my topology is being set up, why am I getting a
NotSerializableException/IllegalStateException?
Answer:
Prior
to the topology being performed, the topology is instantiated and then
serialised to byte format and stored in ZooKeeper as part of the Storm
lifecycle.
Serialization
will fail in this stage if a spout or bolt in the topology has an initialised
unserializable property. If an unserializable field is required, initialise it
in the prepare method of the bolt or spout, which is called after the topology
is supplied to the worker.
Apache Ant Interview Questions and Answers
Ques.
8): In Storm, how do you kill a topology?
Answer:
Storm
kill topology-name [-w wait-time-secs]
The
topology with the name topology-name is destroyed. Storm will turn off the
topology's spouts for the duration of the topology's message timeout to allow
all currently processing messages to complete. Storm will then turn off the
generators and clean up the area. With the -w parameter, you can stop Storm
from pausing between deactivation and shutdown.
Apache Camel Interview Questions and Answers
Ques.
9): What is the best way to write integration tests in Java for an Apache Storm
topology?
Answer:
For
integration testing, you can use LocalCluster. For ideas, have a look at some
of Storm's own integration tests. FeederSpout and FixedTupleSpout are the tools
you should employ. Using the tools in the Testing class, a topology where all
spouts implement the CompletableSpout interface can be run until fulfilment.
Storm tests can additionally choose to "simulate time," which means
the Storm topology will be idle until LocalCluster.advanceClusterTime is
called. This allows you to do assertions, for example, in between bolt emits.
Apache Cassandra Interview Questions and Answers
Ques.
10): What's the difference between Apache Kafka and Apache Storm, and why
should you care?
Answer:
Apache
Kafka: Apache Kafka is a distributed and scalable messaging system that can
manage massive amounts of data while allowing messages to flow from one
end-point to the next. Kafka is meant to allow a single cluster to function as
a huge organization's central data backbone. It may be expanded elastically and
transparently without any downtime. To allow data streams greater than the
capability of any single machine and clusters of coordinated users, data
streams are partitioned and dispersed across a cluster of machines.
Whereas.
Apache
Storm is a real-time message processing system that allows users to update and
manipulate data in real time. Storm takes the data from Kafka and does the
necessary processing. It makes it simple to process limitless streams of data
safely, doing real-time processing in the same way as Hadoop does batch
processing. A storm is easy to use, works with any computer language, and is a
lot of fun.
Apache NiFi Interview Questions and Answers
Ques.
11): When do you invoke the cleanup procedure?
Answer:
When
a Bolt is shut down, the cleanup method is called, and it should clean up any
resources that were opened. There's no guarantee that this method will be
called on the cluster: for example, if the machine where the task is running
crashes, the function will be unavailable.
When
you run topologies in local mode (where a Storm cluster is mimicked), and you
want to be able to run and kill several topologies without experiencing
resource leaks, you should use the cleanup approach.
Ques.
12): What are the common configurations in Apache Storm?
Answer:
Config.TOPOLOGY_WORKERS
: This sets the number of worker processes to use to execute the topology.
Config.TOPOLOGY_ACKER_EXECUTORS
: This sets the number of executors that will track tuple trees and detect when
a spout tuple has been fully processed By not setting this variable or setting
it as null, Storm will set the number of acker executors to be equal to the
number of workers configured for this topology.
Config.TOPOLOGY_MAX_SPOUT_PENDING
: This sets the maximum number of spout tuples that can be pending on a single
spout task at once
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
: This is the maximum amount of time a spout tuple has to be fully completed
before it is considered failed.
Config.TOPOLOGY_SERIALIZATIONS
: You can register more serializers to Storm using this config so that you can
use custom types within tuples.
Ques.
13): What are the advantages of using
Apache Storm?
Answer:
Using
Apache Storm for real-time processing has a number of advantages.
First:
Storm is a breeze to use. Its
pre-configured configurations make it simple to deploy and use.
Second:
it works extremely quickly, with each node capable of processing 100 messages
in a single second.
Third:
Apache Storm is a scalable option since it can work with any programming
language, making it simple to deploy across a large number of machines.
Fourth:
it has the capability of automatically identifying errors and restarting the
workers.
Fifth:
it is one of the most significant benefits of Apache Storm that it is reliable.
It ensures the data to be executed once and in some cases more than once.
Ques.
14): In Storm, tell us about the stream groups that are built-in?
Answer:
Storms
consist of 8 stream grouping that is built-in. These are :
1. Shuffle
Grouping: In Shuffle Grouping, the same quantity of tuples is distributed to
every bolt in a random manner.
2. Field
Grouping: The stream is divided with the help of fields that are specified in
the grouping, in Field Grouping.
3. Partial Key
Grouping: Partial Grouping is similar to field grouping as the stream is
divided with the help of the field that is specified in the grouping but is
different in the sense that it gives resources that are better utilized.
4. All Grouping:
All Grouping should be used carefully since the stream is duplicated across the
task of all bolts.
5. Global
Grouping: In Global Grouping, the complete stream goes to each of the tasks of
bolts.
6. None
Grouping: In None Grouping, the grouping of stream need not have cared.
7. Direct
Grouping: Direct Grouping is a different type of grouping than others. In this,
it is decided by the producer that the task will be received by which task.
8. Local
Grouping: Local Grouping is also known as Shuffle Grouping.
Ques.
15): What role does real-time analytics play?
Answer:
Real-time
analytics is critical, and the need is rapidly increasing. The application
appears to deliver quick solutions with real-time insights. It covers a wide range
of industries, including retail, telecommunications, and finance. In the
banking industry, many frauds are reported. Fraud transactions are one of the
most common types of fraud. Such frauds occur often, and real-time analytics
can assist in detecting and identifying them. It also has a place in the world
of social media sites like Twitter. The most popular subjects on Twitter are
displayed to users. Real-time analytics plays a part in attracting visitors and
generating cash.
Ques.
16): Why SSL is not included in Apache?
Answer:
SSL
is not included in Apache due to some significant reasons. Some governments
don't allow import, export and do not give permission for using the encryption
technology which is required by SSL data transport. If SSL would be included in
Apache then it won't be available freely due to various legal matters. Some
technology of SSL which is used for talking to current clients is under the
patent of RSA Data Security and it does not allow its usage without a license.
Ques.
17): What is the purpose of the Server Type directive in the Apache server?
Answer:
The server type directive in Apache's server determines whether Apache should
keep everything in one process or spawn as a child process. The server type
directive is not accessible in Apache 2.0, hence it is not discovered. It is,
nevertheless, accessible in Apache 1.3 for background compatibility with Apache
UNIX versions.
Ques.
18): Worker processes, executors, and tasks: what forms a running topology?
Answer:
Storm
differentiates between the three major entities that are utilised to run a
topology in a Storm cluster:
•
A worker process is responsible for executing a subset of a
topology. A worker process is associated with a specific topology and may run
one or more executors for one or more of the topology's components (spouts or
bolts). Within a Storm cluster, a running topology comprises of many such
processes running on multiple machines.
•
A thread produced by a worker process is known as an executor. It
may do one or more tasks for the same component at the same time (spout or
bolt).
·
A task performs the actual data processing — each spout or bolt
that you implement in your code executes as many tasks across the cluster. The
number of tasks for a component is always the same throughout the lifetime of a
topology, but the number of executors (threads) for a component can change over
time.
Ques.
19): When a Nimbus or Supervisor daemon dies, what happens?
Answer:
The
Nimbus and Supervisor daemons are designed to be stateless and fail-fast (the
process self-destructs whenever an unexpected circumstance occurs) (all state
is kept in Zookeeper or on disk).
Using
a tool like daemon tools or monit, the Nimbus and Supervisor daemons must be
operated under supervision. As a result, if the Nimbus or Supervisor daemons
die, they will restart as if nothing had happened.
The
death of Nimbus or the Supervisors, in particular, has little effect on worker
procedures. In Hadoop, on the other hand, if the JobTracker dies, all the
running jobs are destroyed.
Ques.
20): What are some general guidelines you can provide me for customising
Storm+Trident?
Answer:
The
number of workers is a multiple of the number of machines; parallelism is also
a multiple of the number of workers; and the number of Kafka partitions is a
multiple of the number of spout parallelism.
Use
one worker per machine per topology.
Begin
with fewer, larger aggregators, one for each machine that has employees.
Make
use of the isolation timer.
Use
one acker per worker by default in version 0.9, although earlier versions do
not.
Enable
GC logging; if everything is in order, you should only notice a few significant
GCs.
Set
the trident batch millis to around 50% of your average end-to-end latency.
Start
with a modest max spout pending — one for trident, or the number of executors
for storm — and gradually increase it until the flow stops changing. You'll
probably get close to 2*(throughput in recs/sec)*(end-to-end delay) (2x the
capacity of Little's law).