April 15, 2022

Top 20 Apache Storm Interview Questions and Answers


                                    Apache Storm is a Clojure-based distributed stream processing platform that is free and open-source. The project was founded by Nathan Marz and the BackType team, and it was open-sourced after Twitter acquired it. Storm makes it simple to reliably process unbounded streams of data, resulting in real-time processing instead of batch processing as Hadoop does. Storm is simple to use and may be used with a variety of computer programming languages.

Apache Kafka Interview Questions and Answers 

Ques. 1): Where would you put Apache Storm to good use?


Storm is used for: Stream processing—Apache Storm is used to process real-time streams of data and update many databases. The input data must be processed at a rate that is equal to that of the input data. Apache Storm's Distributed RPC may parallelize a complex query, allowing it to be computed in real time. Continuous computation—Data streams are processed in real time, and Storm displays the results to clients. This may necessitate processing each message as it arrives or developing it in small batches over a short period of time. Continuous computation is demonstrated by streaming popular themes from Twitter into web browsers. Real-time analytics: Apache Storm will evaluate and respond to data as it arrives in real-time from multiple data sources.


Apache Struts 2 Interview Questions and Answers

Ques. 2): What exactly is Apache Storm? What are the Storm components?


Apache Storm is an open source distributed real-time compute system for real-time large data analytics processing. Apache Storm, unlike Hadoop, supports real-time processing and may be utilised with any programming language.

Apache Storm contains the following components:

Nimbus: It functions as a Job Tracker for Hadoop. It distributes code across the cluster, uploads computation for execution, assigns workers to the cluster, and monitors and reallocates workers as needed.

Zookeeper: It serves as a communication intermediary between the Storm Cluster Supervisor and the Zookeeper. Through Zookeeper, it interacts with Nimbus and runs the procedure based on the signals received from Nimbus.


Apache Spark Interview Questions and Answers

Ques. 3): Storm's Codebase has how many unique layers?


Storm's codebase is divided into three tiers.

First: Storm was built from the ground up to work with a variety of languages. Topologies are defined as Thrift structures in Nimbus, which is a Thrift service. Storm may be utilised from any language thanks to the use of Thrift.

Second: Storm specifies all of its interfaces as Java interfaces. So, despite the fact that Storm's implementation contains a lot of Clojure, all usage must go through the Java API. This means that Storm's whole feature set is always accessible via Java.

Third: Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality, the great majority of the implementation logic is in Clojure.

Apache Hive Interview Questions and Answers

Ques. 4): What are Storm Topologies and How Do They Work?


A Storm topology contains the philosophy for a real-time application. Storm is similar to MapReduce in terms of topology. A key distinction is that a MapReduce job has a finish, whereas a topology can go on indefinitely (or until you kill it, of course). A topology is a graph made up of spouts, bolts, and stream groupings.

Apache Tomcat Interview Questions and Answers

5) What do you mean when you say "nodes"?


The Master Node and the Worker Node are the two types of nodes. The Master Node manages the Nimbus daemon, which assigns work to devices and monitors their performance. The Supervisor daemon, which runs on the Worker Node, assigns responsibilities to other worker nodes and supervises them as needed.

Apache Ambari interview Questions and Answers

Ques. 6): Explain how Apache Storm processes a message completely.


Storm requests a tuple from the Spout by invoking the nextTuple function or method on the Spout. To discharge a tuple to one of its output streams, the Spout uses the SpoutoutputCollector provided in the open method. The Spout assigns a "message id" to each tuple it discharges, which will be used to recognise the tuple afterwards.

The tuple is then transmitted to consuming bolts, and storm is in charge of tracking the message tree that is generated. If the storm is certain that a tuple has been properly digested, it can invoke the ack operation on the original Spout job, passing the message id that the Spout has provided to the Storm.

Apache Tapestry Interview Questions and Answers

Ques. 7): When my topology is being set up, why am I getting a NotSerializableException/IllegalStateException?


Prior to the topology being performed, the topology is instantiated and then serialised to byte format and stored in ZooKeeper as part of the Storm lifecycle.

Serialization will fail in this stage if a spout or bolt in the topology has an initialised unserializable property. If an unserializable field is required, initialise it in the prepare method of the bolt or spout, which is called after the topology is supplied to the worker.

Apache Ant Interview Questions and Answers

Ques. 8): In Storm, how do you kill a topology?


Storm kill topology-name [-w wait-time-secs]

The topology with the name topology-name is destroyed. Storm will turn off the topology's spouts for the duration of the topology's message timeout to allow all currently processing messages to complete. Storm will then turn off the generators and clean up the area. With the -w parameter, you can stop Storm from pausing between deactivation and shutdown.

Apache Camel Interview Questions and Answers

Ques. 9): What is the best way to write integration tests in Java for an Apache Storm topology?


For integration testing, you can use LocalCluster. For ideas, have a look at some of Storm's own integration tests. FeederSpout and FixedTupleSpout are the tools you should employ. Using the tools in the Testing class, a topology where all spouts implement the CompletableSpout interface can be run until fulfilment. Storm tests can additionally choose to "simulate time," which means the Storm topology will be idle until LocalCluster.advanceClusterTime is called. This allows you to do assertions, for example, in between bolt emits.

Apache Cassandra Interview Questions and Answers

Ques. 10): What's the difference between Apache Kafka and Apache Storm, and why should you care?


Apache Kafka: Apache Kafka is a distributed and scalable messaging system that can manage massive amounts of data while allowing messages to flow from one end-point to the next. Kafka is meant to allow a single cluster to function as a huge organization's central data backbone. It may be expanded elastically and transparently without any downtime. To allow data streams greater than the capability of any single machine and clusters of coordinated users, data streams are partitioned and dispersed across a cluster of machines.


Apache Storm is a real-time message processing system that allows users to update and manipulate data in real time. Storm takes the data from Kafka and does the necessary processing. It makes it simple to process limitless streams of data safely, doing real-time processing in the same way as Hadoop does batch processing. A storm is easy to use, works with any computer language, and is a lot of fun.

Apache NiFi Interview Questions and Answers

Ques. 11): When do you invoke the cleanup procedure?


When a Bolt is shut down, the cleanup method is called, and it should clean up any resources that were opened. There's no guarantee that this method will be called on the cluster: for example, if the machine where the task is running crashes, the function will be unavailable.

When you run topologies in local mode (where a Storm cluster is mimicked), and you want to be able to run and kill several topologies without experiencing resource leaks, you should use the cleanup approach.


Ques. 12): What are the common configurations in Apache Storm?


Config.TOPOLOGY_WORKERS : This sets the number of worker processes to use to execute the topology.

Config.TOPOLOGY_ACKER_EXECUTORS : This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology.

Config.TOPOLOGY_MAX_SPOUT_PENDING : This sets the maximum number of spout tuples that can be pending on a single spout task at once

Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS : This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed.

Config.TOPOLOGY_SERIALIZATIONS : You can register more serializers to Storm using this config so that you can use custom types within tuples.


Ques. 13):  What are the advantages of using Apache Storm?


Using Apache Storm for real-time processing has a number of advantages.

First:  Storm is a breeze to use. Its pre-configured configurations make it simple to deploy and use.

Second: it works extremely quickly, with each node capable of processing 100 messages in a single second.

Third: Apache Storm is a scalable option since it can work with any programming language, making it simple to deploy across a large number of machines.

Fourth: it has the capability of automatically identifying errors and restarting the workers.

Fifth: it is one of the most significant benefits of Apache Storm that it is reliable. It ensures the data to be executed once and in some cases more than once.


Ques. 14): In Storm, tell us about the stream groups that are built-in?


Storms consist of 8 stream grouping that is built-in. These are :

1.      Shuffle Grouping: In Shuffle Grouping, the same quantity of tuples is distributed to every bolt in a random manner.

2.      Field Grouping: The stream is divided with the help of fields that are specified in the grouping, in Field Grouping.

3.      Partial Key Grouping: Partial Grouping is similar to field grouping as the stream is divided with the help of the field that is specified in the grouping but is different in the sense that it gives resources that are better utilized.

4.      All Grouping: All Grouping should be used carefully since the stream is duplicated across the task of all bolts.

5.      Global Grouping: In Global Grouping, the complete stream goes to each of the tasks of bolts.

6.      None Grouping: In None Grouping, the grouping of stream need not have cared.

7.      Direct Grouping: Direct Grouping is a different type of grouping than others. In this, it is decided by the producer that the task will be received by which task.

8.      Local Grouping: Local Grouping is also known as Shuffle Grouping.


Ques. 15): What role does real-time analytics play?


Real-time analytics is critical, and the need is rapidly increasing. The application appears to deliver quick solutions with real-time insights. It covers a wide range of industries, including retail, telecommunications, and finance. In the banking industry, many frauds are reported. Fraud transactions are one of the most common types of fraud. Such frauds occur often, and real-time analytics can assist in detecting and identifying them. It also has a place in the world of social media sites like Twitter. The most popular subjects on Twitter are displayed to users. Real-time analytics plays a part in attracting visitors and generating cash.


Ques. 16): Why SSL is not included in Apache?


SSL is not included in Apache due to some significant reasons. Some governments don't allow import, export and do not give permission for using the encryption technology which is required by SSL data transport. If SSL would be included in Apache then it won't be available freely due to various legal matters. Some technology of SSL which is used for talking to current clients is under the patent of RSA Data Security and it does not allow its usage without a license.


Ques. 17): What is the purpose of the Server Type directive in the Apache server?

Answer: The server type directive in Apache's server determines whether Apache should keep everything in one process or spawn as a child process. The server type directive is not accessible in Apache 2.0, hence it is not discovered. It is, nevertheless, accessible in Apache 1.3 for background compatibility with Apache UNIX versions.


Ques. 18): Worker processes, executors, and tasks: what forms a running topology?


Storm differentiates between the three major entities that are utilised to run a topology in a Storm cluster:

        A worker process is responsible for executing a subset of a topology. A worker process is associated with a specific topology and may run one or more executors for one or more of the topology's components (spouts or bolts). Within a Storm cluster, a running topology comprises of many such processes running on multiple machines.

        A thread produced by a worker process is known as an executor. It may do one or more tasks for the same component at the same time (spout or bolt).

·         A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time.


Ques. 19): When a Nimbus or Supervisor daemon dies, what happens?


The Nimbus and Supervisor daemons are designed to be stateless and fail-fast (the process self-destructs whenever an unexpected circumstance occurs) (all state is kept in Zookeeper or on disk).

Using a tool like daemon tools or monit, the Nimbus and Supervisor daemons must be operated under supervision. As a result, if the Nimbus or Supervisor daemons die, they will restart as if nothing had happened.

The death of Nimbus or the Supervisors, in particular, has little effect on worker procedures. In Hadoop, on the other hand, if the JobTracker dies, all the running jobs are destroyed.


Ques. 20): What are some general guidelines you can provide me for customising Storm+Trident?


The number of workers is a multiple of the number of machines; parallelism is also a multiple of the number of workers; and the number of Kafka partitions is a multiple of the number of spout parallelism.

Use one worker per machine per topology.

Begin with fewer, larger aggregators, one for each machine that has employees.

Make use of the isolation timer.

Use one acker per worker by default in version 0.9, although earlier versions do not.

Enable GC logging; if everything is in order, you should only notice a few significant GCs.

Set the trident batch millis to around 50% of your average end-to-end latency.

Start with a modest max spout pending — one for trident, or the number of executors for storm — and gradually increase it until the flow stops changing. You'll probably get close to 2*(throughput in recs/sec)*(end-to-end delay) (2x the capacity of Little's law).

No comments:

Post a Comment