April 15, 2022

Top 20 Apache Storm Interview Questions and Answers

     

                                    Apache Storm is a Clojure-based distributed stream processing platform that is free and open-source. The project was founded by Nathan Marz and the BackType team, and it was open-sourced after Twitter acquired it. Storm makes it simple to reliably process unbounded streams of data, resulting in real-time processing instead of batch processing as Hadoop does. Storm is simple to use and may be used with a variety of computer programming languages.


Apache Kafka Interview Questions and Answers 


Ques. 1): Where would you put Apache Storm to good use?

Answer:

Storm is used for: Stream processing—Apache Storm is used to process real-time streams of data and update many databases. The input data must be processed at a rate that is equal to that of the input data. Apache Storm's Distributed RPC may parallelize a complex query, allowing it to be computed in real time. Continuous computation—Data streams are processed in real time, and Storm displays the results to clients. This may necessitate processing each message as it arrives or developing it in small batches over a short period of time. Continuous computation is demonstrated by streaming popular themes from Twitter into web browsers. Real-time analytics: Apache Storm will evaluate and respond to data as it arrives in real-time from multiple data sources.

 

Apache Struts 2 Interview Questions and Answers


Ques. 2): What exactly is Apache Storm? What are the Storm components?

Answer:

Apache Storm is an open source distributed real-time compute system for real-time large data analytics processing. Apache Storm, unlike Hadoop, supports real-time processing and may be utilised with any programming language.

Apache Storm contains the following components:

Nimbus: It functions as a Job Tracker for Hadoop. It distributes code across the cluster, uploads computation for execution, assigns workers to the cluster, and monitors and reallocates workers as needed.

Zookeeper: It serves as a communication intermediary between the Storm Cluster Supervisor and the Zookeeper. Through Zookeeper, it interacts with Nimbus and runs the procedure based on the signals received from Nimbus.

 

Apache Spark Interview Questions and Answers


Ques. 3): Storm's Codebase has how many unique layers?

Answer:

Storm's codebase is divided into three tiers.

First: Storm was built from the ground up to work with a variety of languages. Topologies are defined as Thrift structures in Nimbus, which is a Thrift service. Storm may be utilised from any language thanks to the use of Thrift.

Second: Storm specifies all of its interfaces as Java interfaces. So, despite the fact that Storm's implementation contains a lot of Clojure, all usage must go through the Java API. This means that Storm's whole feature set is always accessible via Java.

Third: Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality, the great majority of the implementation logic is in Clojure.


Apache Hive Interview Questions and Answers


Ques. 4): What are Storm Topologies and How Do They Work?

Answer:

A Storm topology contains the philosophy for a real-time application. Storm is similar to MapReduce in terms of topology. A key distinction is that a MapReduce job has a finish, whereas a topology can go on indefinitely (or until you kill it, of course). A topology is a graph made up of spouts, bolts, and stream groupings.


Apache Tomcat Interview Questions and Answers


5) What do you mean when you say "nodes"?

Answer:

The Master Node and the Worker Node are the two types of nodes. The Master Node manages the Nimbus daemon, which assigns work to devices and monitors their performance. The Supervisor daemon, which runs on the Worker Node, assigns responsibilities to other worker nodes and supervises them as needed.


Apache Ambari interview Questions and Answers


Ques. 6): Explain how Apache Storm processes a message completely.

Answer:

Storm requests a tuple from the Spout by invoking the nextTuple function or method on the Spout. To discharge a tuple to one of its output streams, the Spout uses the SpoutoutputCollector provided in the open method. The Spout assigns a "message id" to each tuple it discharges, which will be used to recognise the tuple afterwards.

The tuple is then transmitted to consuming bolts, and storm is in charge of tracking the message tree that is generated. If the storm is certain that a tuple has been properly digested, it can invoke the ack operation on the original Spout job, passing the message id that the Spout has provided to the Storm.


Apache Tapestry Interview Questions and Answers


Ques. 7): When my topology is being set up, why am I getting a NotSerializableException/IllegalStateException?

Answer:

Prior to the topology being performed, the topology is instantiated and then serialised to byte format and stored in ZooKeeper as part of the Storm lifecycle.

Serialization will fail in this stage if a spout or bolt in the topology has an initialised unserializable property. If an unserializable field is required, initialise it in the prepare method of the bolt or spout, which is called after the topology is supplied to the worker.


Apache Ant Interview Questions and Answers


Ques. 8): In Storm, how do you kill a topology?

Answer:

Storm kill topology-name [-w wait-time-secs]

The topology with the name topology-name is destroyed. Storm will turn off the topology's spouts for the duration of the topology's message timeout to allow all currently processing messages to complete. Storm will then turn off the generators and clean up the area. With the -w parameter, you can stop Storm from pausing between deactivation and shutdown.


Apache Camel Interview Questions and Answers


Ques. 9): What is the best way to write integration tests in Java for an Apache Storm topology?

Answer:

For integration testing, you can use LocalCluster. For ideas, have a look at some of Storm's own integration tests. FeederSpout and FixedTupleSpout are the tools you should employ. Using the tools in the Testing class, a topology where all spouts implement the CompletableSpout interface can be run until fulfilment. Storm tests can additionally choose to "simulate time," which means the Storm topology will be idle until LocalCluster.advanceClusterTime is called. This allows you to do assertions, for example, in between bolt emits.


Apache Cassandra Interview Questions and Answers


Ques. 10): What's the difference between Apache Kafka and Apache Storm, and why should you care?

Answer:

Apache Kafka: Apache Kafka is a distributed and scalable messaging system that can manage massive amounts of data while allowing messages to flow from one end-point to the next. Kafka is meant to allow a single cluster to function as a huge organization's central data backbone. It may be expanded elastically and transparently without any downtime. To allow data streams greater than the capability of any single machine and clusters of coordinated users, data streams are partitioned and dispersed across a cluster of machines.

Whereas.

Apache Storm is a real-time message processing system that allows users to update and manipulate data in real time. Storm takes the data from Kafka and does the necessary processing. It makes it simple to process limitless streams of data safely, doing real-time processing in the same way as Hadoop does batch processing. A storm is easy to use, works with any computer language, and is a lot of fun.


Apache NiFi Interview Questions and Answers


Ques. 11): When do you invoke the cleanup procedure?

Answer:

When a Bolt is shut down, the cleanup method is called, and it should clean up any resources that were opened. There's no guarantee that this method will be called on the cluster: for example, if the machine where the task is running crashes, the function will be unavailable.

When you run topologies in local mode (where a Storm cluster is mimicked), and you want to be able to run and kill several topologies without experiencing resource leaks, you should use the cleanup approach.

 

Ques. 12): What are the common configurations in Apache Storm?

Answer:

Config.TOPOLOGY_WORKERS : This sets the number of worker processes to use to execute the topology.

Config.TOPOLOGY_ACKER_EXECUTORS : This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology.

Config.TOPOLOGY_MAX_SPOUT_PENDING : This sets the maximum number of spout tuples that can be pending on a single spout task at once

Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS : This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed.

Config.TOPOLOGY_SERIALIZATIONS : You can register more serializers to Storm using this config so that you can use custom types within tuples.

 

Ques. 13):  What are the advantages of using Apache Storm?

Answer:

Using Apache Storm for real-time processing has a number of advantages.

First:  Storm is a breeze to use. Its pre-configured configurations make it simple to deploy and use.

Second: it works extremely quickly, with each node capable of processing 100 messages in a single second.

Third: Apache Storm is a scalable option since it can work with any programming language, making it simple to deploy across a large number of machines.

Fourth: it has the capability of automatically identifying errors and restarting the workers.

Fifth: it is one of the most significant benefits of Apache Storm that it is reliable. It ensures the data to be executed once and in some cases more than once.

 

Ques. 14): In Storm, tell us about the stream groups that are built-in?

Answer:

Storms consist of 8 stream grouping that is built-in. These are :

1.      Shuffle Grouping: In Shuffle Grouping, the same quantity of tuples is distributed to every bolt in a random manner.

2.      Field Grouping: The stream is divided with the help of fields that are specified in the grouping, in Field Grouping.

3.      Partial Key Grouping: Partial Grouping is similar to field grouping as the stream is divided with the help of the field that is specified in the grouping but is different in the sense that it gives resources that are better utilized.

4.      All Grouping: All Grouping should be used carefully since the stream is duplicated across the task of all bolts.

5.      Global Grouping: In Global Grouping, the complete stream goes to each of the tasks of bolts.

6.      None Grouping: In None Grouping, the grouping of stream need not have cared.

7.      Direct Grouping: Direct Grouping is a different type of grouping than others. In this, it is decided by the producer that the task will be received by which task.

8.      Local Grouping: Local Grouping is also known as Shuffle Grouping.

 

Ques. 15): What role does real-time analytics play?

Answer:

Real-time analytics is critical, and the need is rapidly increasing. The application appears to deliver quick solutions with real-time insights. It covers a wide range of industries, including retail, telecommunications, and finance. In the banking industry, many frauds are reported. Fraud transactions are one of the most common types of fraud. Such frauds occur often, and real-time analytics can assist in detecting and identifying them. It also has a place in the world of social media sites like Twitter. The most popular subjects on Twitter are displayed to users. Real-time analytics plays a part in attracting visitors and generating cash.

 

Ques. 16): Why SSL is not included in Apache?

Answer:

SSL is not included in Apache due to some significant reasons. Some governments don't allow import, export and do not give permission for using the encryption technology which is required by SSL data transport. If SSL would be included in Apache then it won't be available freely due to various legal matters. Some technology of SSL which is used for talking to current clients is under the patent of RSA Data Security and it does not allow its usage without a license.

 

Ques. 17): What is the purpose of the Server Type directive in the Apache server?

Answer: The server type directive in Apache's server determines whether Apache should keep everything in one process or spawn as a child process. The server type directive is not accessible in Apache 2.0, hence it is not discovered. It is, nevertheless, accessible in Apache 1.3 for background compatibility with Apache UNIX versions.

 

Ques. 18): Worker processes, executors, and tasks: what forms a running topology?

Answer:

Storm differentiates between the three major entities that are utilised to run a topology in a Storm cluster:

        A worker process is responsible for executing a subset of a topology. A worker process is associated with a specific topology and may run one or more executors for one or more of the topology's components (spouts or bolts). Within a Storm cluster, a running topology comprises of many such processes running on multiple machines.

        A thread produced by a worker process is known as an executor. It may do one or more tasks for the same component at the same time (spout or bolt).

·         A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time.

 

Ques. 19): When a Nimbus or Supervisor daemon dies, what happens?

Answer:

The Nimbus and Supervisor daemons are designed to be stateless and fail-fast (the process self-destructs whenever an unexpected circumstance occurs) (all state is kept in Zookeeper or on disk).

Using a tool like daemon tools or monit, the Nimbus and Supervisor daemons must be operated under supervision. As a result, if the Nimbus or Supervisor daemons die, they will restart as if nothing had happened.

The death of Nimbus or the Supervisors, in particular, has little effect on worker procedures. In Hadoop, on the other hand, if the JobTracker dies, all the running jobs are destroyed.

 

Ques. 20): What are some general guidelines you can provide me for customising Storm+Trident?

Answer:

The number of workers is a multiple of the number of machines; parallelism is also a multiple of the number of workers; and the number of Kafka partitions is a multiple of the number of spout parallelism.

Use one worker per machine per topology.

Begin with fewer, larger aggregators, one for each machine that has employees.

Make use of the isolation timer.

Use one acker per worker by default in version 0.9, although earlier versions do not.

Enable GC logging; if everything is in order, you should only notice a few significant GCs.

Set the trident batch millis to around 50% of your average end-to-end latency.

Start with a modest max spout pending — one for trident, or the number of executors for storm — and gradually increase it until the flow stops changing. You'll probably get close to 2*(throughput in recs/sec)*(end-to-end delay) (2x the capacity of Little's law).




Top 20 Google Cloud Computing Interview Questions and Answers

The Google Cloud Computing Platform is a rapidly evolving industry standard, and many organizations have a successful application that is promoted in a variety of ways. Every organization has a variety of cloud computing options, including roles such as Cloud Computing Manager, Cloud Computing Architect, Module Lead, Cloud Engineer, Cloud Computing Trainer, and so on. Below are the most often asked questions and answers in this sector, which will be useful to all candidates.

Google Cloud Platform (GCP) is a set of cloud computing services supplied by Google that run on the same infrastructure as Google's internal products, such as Google Search, Gmail, and YouTube.

Google has introduced a number of cloud services to the App Engine platform since its launch. Its specialization is offering a platform for individuals and businesses to create and execute software, and it connects those users over the internet.

 

Ques. 1): What do you understand by Cloud Computing?

Answer:

Cloud computing is described as computer power that is entirely stored in the cloud at all times. It is one of the most recent developments in the online saga sector, and it mostly relies on the Internet, i.e. the Cloud, for delivery. The cloud computing service is genuinely worldwide, with no regional or border limits.

 

Ques. 2): What is the difference between cloud computing and virtualization?

Answer:

        Cloud computing is a set of layers that work together to provide IP-based computing; virtualization is a layer/module inside the cloud computing architecture that allows providers to supply IaaS (Infrastructure as a Service) on demand.

        Virtualization is a software that allows you to generate "isolated" images of your hardware and software on the same machine. This allows various operating systems, software, and applications to be installed on the same physical computer.

 

Ques. 3): Tell us about Google Cloud's multiple tiers.

Answer:

The Google cloud platform is divided into four layers:

1. Infrastructure as a Service (IaaS): This is the foundational layer, which includes hardware and networking.

2. Platform as a Service (PaaS): This is the second layer, which includes both the infrastructure and the resources needed to construct apps.

3. Software as a Service (SaaS): SaaS is the third layer that allows users to access the service provider's numerous cloud products.

4. Business Process Outsourcing: Despite the fact that BPO is not a technical solution, it is the final layer. BPO refers to outsourcing services to a vendor who would handle any issues that the end-user may encounter when using cloud computing services.

 

Ques. 4): What are the most important characteristics of cloud services?

Answer:

Cloud computing and cloud services as a whole provide a slew of capabilities and benefits. The items listed below are the same. The convenience of being able to access and control commercial software from anywhere on the planet.

        The capacity to build and develop web applications capable of handling multiple customers from around the world at the same time, and to quickly centralise all software management tasks to a central web service.

        By centralising and automating the updating process for all applications installed on the platform, the need to download software upgrades will be eliminated.

 

Ques. 5): What is GCP Object Versioning?

Answer:

Object versioning is a method of recovering data that has been overwritten or destroyed. When objects are destroyed or overwritten, object versioning increases storage costs while maintaining object security. When you activate object versioning in your GCP bucket, a noncurrent version of the object is created every time the item is overwritten or removed. To identify a variant of an entity, properties generation and meta generation are utilised. The phrase generation refers to the act of producing material, whereas meta generation is the process of producing metadata.

 

Ques. 6): Why is it necessary for businesses to manage their workload?

Answer:

A workload in an organisation can be characterised as a self-contained service with its own set of code that must be executed. Everything from data-intensive workloads to transaction and storage processing is included in this task. All of this labour is independent of external factors.

The following are the primary reasons why businesses should manage their workload.

        To get a sense of how their applications are performing.

        To be able to pinpoint exactly what functions are taking place.

        To obtain a sense of how much a specific agency will charge for using these services.

 

Ques. 7): What is the relationship between Google Compute Engine and Google App Engine?

Answer:

        Google Compute Engine and Google App Engine are mutually beneficial. Google Compute Engine is an IaaS service, while Google App Engine is a PaaS service.

        Web-based applications, mobile backends, and line-of-business applications are typically operated on Google App Engine. Compute Engine is an excellent alternative if you want more control over the underlying infrastructure. Compute Engine, for example, can be used to construct bespoke business logic or to run your own storage system.

 

Ques. 8): What are the main components of the Google Cloud Platform?

Answer:

The Google Cloud Platform (GCP) is made up of a number of components that assist users in various ways. I'm familiar with the following GCP elements:

                    Google Compute Engine

                    Google Cloud Container Engine

                    Google Cloud App Engine

                    Google Cloud Storage

                    Google Cloud Dataflow

                    Google BigQuery Service

                    Google Cloud Job Discovery

                    Google Cloud Endpoints

                    Google Cloud Test Lab

                    Google Cloud Machine Learning Engine

 

Ques. 9): What are the different GCP roles you can explores?

Within Google Cloud Platform, there are many positions based on the tasks and responsibilities.

        Cloud software engineer: A cloud software engineer is a software developer who focuses on cloud computing systems. This position entails the creation of new systems or the upgrade of current ones.

        Cloud software consultant: This position comprises finding solutions to Google's cloud computing customers' complicated problems.

        Technical programme managers: To oversee the planning, communication, and execution of diverse cloud solutions, you'll require appropriate technical competence in cloud computing.

        Cloud engineering managers: Software engineers hired for this position are responsible for designing and delivering internet-scale solutions and products within the cloud computing infrastructure.

        Cloud engineering support: As a software engineer, you could be in charge of managing cloud computing systems and providing technical help to cloud customers who are having problems.

        Product managers for cloud products: As a product manager, you'd be in charge of overseeing the development of new cloud products from conception to launch.

 

Ques. 10): In Google Cloud Storage, what is a bucket?

Answer:

Buckets are the most fundamental containers for storing information. You may arrange data and grant control access to buckets. The bucket has a globally unique name that corresponds to the location where the contents are kept. It also contains a default storage class that is applied to objects that are added to the bucket without a storage class defined. The number of buckets that can be created or deleted is similarly unlimited.

 

Ques. 11): What is Cloud Armor, exactly?

Answer:

It will aid in the protection of your infrastructure and application from DDoS attacks. It protects your infrastructure by working with HTTPS load balancers. For the same, we can accept or disallow the rule. Cloud Armor's rules language is flexible, allowing for customization of defence and mitigation of attacks. It also contains predefined rules to protect against application-aware cross-site scripting (XSS) and SQL injection (SQLi) attacks. If you're running a web application, the allow and deny rules you set up will help you protect against SQL injection, DDoS attacks, and other threats.

 

Ques. 12): In cloud computing, what is load balancing?

Ans: In a cloud computing context, load balancing is the practise of spreading computer resources and workloads to control demand. It aids in achieving high performance at lower costs by effectively managing workload demands through resource allocation. It makes use of the concepts of scalability and agility to increase resource availability in response to demand. It's also utilised to keep track of the cloud application's health. All of the major cloud companies, such as AWS, GCP, Azure, and others, provide this feature.

 

Ques. 13): What is Google BigQuery, and how does it work? What are the advantages of BigQuery for data warehouse administrators?

Ans: Google BigQuery is a software platform that replaces the traditional data warehouse's hardware architecture. It is employed as a data warehouse and hence serves as a central repository for all of an organization's analytical data. In addition, BigQuery divides the data table into components called as datasets.

For data warehouse practitioners, BigQuery comes in handy in a number of ways. Here are a few of them:

        BigQuery dynamically assigned query resources and storage resources based on demand and usage. As a result, it does not necessitate resource provisioning prior to use.

·         For efficient storage management, BigQuery stores data in a variety of ways, including proprietary format, proprietary columnar format, query access pattern, Google's distributed file system, and others.

        BigQuery is completely up to date and controlled.

        BigQuery enables a broader level of backup recovery and catastrophe recovery.

        BigQuery engineers manage the service's updates and maintenance completely without any downtime or performance degradation. Users can easily reverse changes and return to a previous state without having to request a backup recovery.

 

Ques. 14): What are the primary benefits of utilising Google Cloud Platform?

Answer:

Google Cloud Platform is a platform that connects customers to the greatest cloud services and features available. It is gaining popularity among cloud experts and users due to the benefits it provides.

The following are the key benefits of adopting Google Cloud Platform over other platforms:

        When compared to other cloud service providers, GCP offers significantly lower price.

        When it comes to hosting cloud services, GCP has improved performance and service generally.

        Google Cloud is very quick to provide server and security updates in a more timely and effective manner.

        The security level of Google Cloud Platform is exemplary; the cloud platform and networks are secured and encrypted with various security measures.

 

Ques. 15): What are the different types of service accounts? How are you going to make one?

Answer:

        The service accounts are used to authorise Google Compute Engine to undertake tasks on behalf of the user, allowing it to access non-sensitive data and information.

        By handling the user's authorization procedure, these accounts often facilitate the authentication process from Google Cloud Engine to other services. It is important to note that service accounts are not utilised to gain access to the user's information.

        Google offers several different sorts of service accounts, however most users prefer to use one of two types of service accounts:

        Service accounts for Google Cloud Platform Console

        Accounts for the Google Compute Engine service

The user doesn’t need to create a service account manually. It is automatically created by the Compute Engine whenever a new instance is created. Google Compute Engine also specifies the scope of the service account for that particular instance when it is created.

 

Ques. 16): What are the multiple Google Cloud SDK installation options?

Answer:

The Google Cloud SDK can be installed using one of four distinct methods. The user can install Google Cloud Software Development Kit using any of the options below, depending on their needs.

        Using Google Cloud SDK with scripts, continuous integration, or continuous deployment — in this scenario, the user can download a versioned archive for a non-interactive installation of a given version of Cloud SDK.

        YUM is used to download the latest published version of the Google Cloud SDK in package format when operating Red Hat Enterprise Linux 7/CentOS 7.

        APT-Download is used to get the latest released version of the Google Cloud SDK in package format while operating Ubuntu/Debian.

        The user can utilise the interactive installer to install the newest version of the Google Cloud SDK for all other use cases.

 

Ques. 17): How are you going to ask for greater quota for your project?

Answer:

        Default quotas for various types of resources are provided to all Google Compute Engine projects. Quotas can also be increased on a per-project basis.

        If you find that you have hit the quota limit for your resources and wish to increase the quota, you can make a request for more quota for some specific resources using the IAM quotas page on the Google Cloud Platform Console. Using the Edit Quotas button at the top of the page, you can request more quota.

 

Ques. 18): What are your impressions about Google Compute Engine?

Answer:

        Google Compute Engine is an IaaS offering that provides self-managed and configurable virtual machines hosted on Google's infrastructure. It features virtual machines based on Windows and Linux that may run on local, KVM, and persistent storage, as well as a REST-based API for control and setup.

        Google Compute Engine interfaces with other Google Cloud Platform technologies, such as Google App Engine, Google Cloud Storage, and Google BigQuery, to expand its computing capabilities and hence enable more sophisticated and complicated applications.

 

Ques.19): What is the difference between a Project Number and a Project Id?

Answer:

The two elements that can be utilised to identify a project are the project id and the project number. The distinctions between the two are as follows:

When a new project is created, the project number is generated automatically, whereas the project number is generated by the user. The project number is necessary for many services, however the project id is optional (but it is a must for the Compute Engine).

 

Ques. 20): What are BigQuery's benefits for data warehouse administrators?

Answer:

        BigQuery is useful for data warehouse professionals in a variety of ways. Here are several examples:

        BigQuery allocated query and storage resources dynamically based on demand and usage. As a result, resource provisioning is not required prior to use.

        BigQuery stores data in a number of formats for effective storage management, including proprietary format, proprietary columnar format, query access pattern, Google's distributed file system, and others.

        BigQuery is a fully managed and up-to-date service. Without any downtime or performance reduction, BigQuery engineers manage all of the service's updates and maintenance.