Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Wednesday, 17 November 2021

Top 20 Apache NiFi Interview Questions & Answers


Ques: 1). Is there a functional overlap between NiFi and Kafka?


This is a pretty typical question, and the situation is actually extremely complementary. When you have a large number of customers drawing from the same topic, a Kafka broker gives very low latency. However, Kafka isn't built to tackle dataflow problems - imagine data prioritisation and enrichment — Kafka isn't built for that. Furthermore, unlike NIFI, which can handle messages of any size, Kafka prefers smaller messages in the KB to MB range, whereas NiFi can accept files up to GB per file or more. NiFi is an add-on to Kafka that solves all of Kafka's dataflow issues.


BlockChain Interview Question and Answers

Ques: 2). What is Apache NiFi, and how does it work?


Apache NiFi is a dataflow automation and enterprise integration solution that allows you to send, receive, route, alter, and modify data as needed, all while being automated and configurable. NiFi can connect many information systems as well as several types of sources and destinations such as HTTP, FTP, HDFS, File System, and various databases.

Apache Spark Interview Questions & Answers

Ques: 3). Is NiFi a viable alternative to ETL and batch processing?


For certain use situations, NiFi can likely replace ETL, and it can also be utilised for batch processing. However, the type of processing/transformation required by the use case should be considered. Flow Files are used in NiFi to define the events, objects, and data that pass through the flow. While NiFi allows you to perform any transformation per Flow File, you shouldn't use it to combine Flow Files together based on a common column or perform certain sorts of windowing aggregations. Cloudera advises utilising extra solutions in this situation.

The ideal choice in a streaming use scenario is to have the records transmitted to one or more Kafka topics utilising NiFi's record processors. Based on our acquisition of Eventador, you can then have Flink execute any of the processing you want on this data (joining streams or doing windowing operations) using Continuous SQL.

NiFi would be treated as an ELT rather than an ETL in a batch use scenario (E = extract, T = transform, L = load). NiFi would collect the various datasets, do the necessary transformations (schema validation, format transformation, data cleansing, and so on) on each dataset, and then transmit the datasets to a Hive-powered data warehouse. Once the data is sent there, NiFi could trigger a Hive query to perform the joint operation.

Apache Hive Interview Questions & Answers

Ques: 4). Is Nifi a Master-Server Architecture?


No, the 0-master philosophy has been considered since NiFi 1.0. In addition, each node in the NiFi cluster is identical. The Zookeeper is in charge of the NiFi cluster. ZooKeeper chooses a single node to serve as the Cluster Coordinator, and ZooKeeper handles failover for you. The Cluster Coordinator receives heartbeat and status information from all cluster nodes. The Cluster Coordinator is in charge of detaching and reconnecting nodes in the cluster. Every cluster also has one Primary Node, which is chosen by ZooKeeper.

Apache Ambari interview Questions & Answers

Ques: 5). What is the role of Apache NiFi in Big Data Ecosystem?


The main roles Apache NiFi is suitable for in BigData Ecosystem are:

Data acquisition and delivery.

Transformations of data.

Routing data from different source to destination.

Event processing.

End to end provenance.

Edge intelligence and bi-directional communication.

Apache Tapestry Interview Questions and Answers

Ques: 6). What are the component of flowfile?


There are two sections to a FlowFile:

Content: The content is a stream of bytes that transports from source to destination and contains a pointer to the actual data being processed in the dataflow. Keep in mind that the flowfile is merely a link to the content data, not the data itself. The actual content will be stored in NiFi's Content Repository.

Attributes: The attributes are key-value pairs that are associated with the data and serve as the flowfile's metadata. These characteristics are typically used to store values that give meaning to the data. Filename, UUID, and other properties are examples. MIME Type, Flowfile creating time etc.


Ques: 7). What exactly is the distinction between MiNiFi and NiFi?


Agents called MiNiFi are used to collect data from sensors and devices in remote areas. The purpose is to assist with data collection's "initial mile" and to obtain data as close to its source as possible.

These devices can include servers, workstations, and laptops, as well as sensors, self-driving cars, factory machinery, and other devices where you want to collect specialised data using MiNiFi's NiFi features. Before transferring data to a destination, the ability to filter, select, and triage it.

The objective of MiNiFi is to manage this entire process at scale with Edge Flow Manager so the Operations or IT teams can deploy different flow definitions and collect any data as the business requires. Here are some details to consider:

To move data around or collect data from well-known external systems like databases, object stores, and so on, NiFi is designed to be centrally situated, usually in a data centre or in the cloud. In a hybrid cloud architecture, NiFi should be viewed as a gateway for moving data back and forth between diverse environments.

MiNiFi connects to a host, does some processing and logic, and only distributes the data you care about to external data distribution platforms. Of course, such systems can be NiFi, but they can also be MQTT brokers, cloud provider services, and so on. MiNiFi also enables use scenarios where network capacity is constrained and data volume transferred over the network must be reduced.

MiNiFi is available in two flavours: C++ and Java. The MiNiFi C++ option has a modest footprint (a few MBs of memory, a small CPU), but a limited number of processors. The MiNiFi Java option is a single-node lightweight version of NiFi that lacks the user interface and clustering capabilities. It does, however, necessitate the presence of Java on the host.


Ques: 8). Will we be able to arrange the flow to automotive management after the coordinator is in place?


As Apache NiFi is designed to work on the idea of continuous streaming, the processors are already set for eternity twist by default. Unless we opt to handle a processor without assistance, for example, on an hourly or daily basis today. Apache NiFi, on the other hand, isn't supposed to be a job-oriented matter. When we put a processor in the bureau, it operates all of the time.


Ques: 9). What are the main features of NiFi?


The main features of Apache NiFi are.

Highly Configurable: Apache NiFi is highly flexible in configurations and allows us to decide what kind of configuration we want. For example, some of the possibilities are.

Loss tolerant cs Guaranteed delivery

Low latency vs High throughput

Dynamic prioritization

Flow can be modified at runtime

Back pressure

Designed for extension:We can build our own processors and controllers etc.


SSL, SSH, HTTPS, encrypted content etc.

Multi-tenant authorization and internal authorization/policy management


Ques: 10). Is there a NiFi connector for any RDBMS database?


Yes, different processors included in NiFi can be used to communicate with RDBMS in various ways. For example, "ExecuteSQL" lets you issue a SQL SELECT statement to a configured JDBC connection to retrieve rows from a database; "QueryDatabaseTable" lets you incrementally fetch from a DB table; and "GenerateTableFetch" lets you not only incrementally fetch the records, but also against source table partitions.


Ques: 11). What is the best way to expose REST API for real-time data collection at scale?


Our customer utilises NiFi to expose a REST API allowing data to be sent to a destination from external sources. HTTP is the most widely used protocol.

If you want to ingest data, you'll utilise the ListenHTTP processor in NIFi, which you may configure to listen to a certain port for HTTP requests and deliver any data to.

Look at the HandleHTTPRequest and HandleHTTPResponse processors if you wish to implement a web service with NiFi. You will receive an HTTP request from an external client if you use the two processors together. You'll be able to respond to the customer with a customised answer/result based on the data in the request. For example, you can use NiFi to connect to remote systems via HTTP, such as an FTP server. The two processors would be used, and the request would be made over HTTP. When NIFi receives a query, it runs a query on the FTP server to retrieve the file, which is then returned to the client.

NiFi can handle all of these one-of-a-kind needs with ease. In this scenario, NiFi would scale horizontally to meet the needs, and a load balancer would be placed in front of the NiFi instances to distribute the load throughout the cluster's NiFi nodes.


Ques: 12). When NiFi pulls data, do the attributes get added to the content (real data)?


You may absolutely add attributes to your FlowFiles at any moment; after all, the purpose of separating metadata from actual data is to allow you to do so. A FlowFile is a representation of an object or a message travelling via NiFi. Each FlowFile has a piece of content, which are the bytes themselves. The properties can then be extracted from the material and stored in memory. You can then use those properties in memory to perform operations without having to touch your content. You can save a lot of IO overhead this way, making the entire flow management procedure much more efficient.


Ques: 13). Is it possible for NiFi to link to external sources such as Twitter?


Absolutely. NIFI's architecture is extremely flexible, allowing any developer or user to quickly add a data source connector. We had 170+ processors packaged with the application by default in the previous edition, NIFI 1.0, including the Twitter processor. Every release will very certainly include new processors/extensions in the future.


Ques: 14). What's the difference between NiFi and Flume cs Sqoop?


NiFi supports all of Flume's use cases and includes the Flume processor out of the box.

Sqoop's features are also supported by NiFi. GenerateTableFetch, for example, is a processor that performs incremental and concurrent fetches against source table partitions.

At the end of the day, we want to know if we're solving a specific or unique use case. If that's the case, any of the tools will suffice. When we consider several use cases being handled at once, as well as essential flow management features like interactive, real-time command and control with full data provenance, NiFi's benefits will really shine.


Ques: 15).What happens to data if NiFi goes down?


As data moves through the system, NiFi stores it in the repository. There are three important repositories:

The flowfile repository.

The content repository.

The provenance reposiroty.

When a processor finishes writing data to a flowfile that is streamed directly to the content repository, it commits the session. This updates the provenance repository to include the events that occurred for that processor, and it also updates the flowfile repository to maintain track of where the file is in the flow. Finally, the flowfile can be moved to the flow's next queue.

NiFi will be able to restart where it left off if it goes down at any point. This, however, overlooks one detail: when we update the repositories, we write the into the repository by default, but the OS frequently caches this. If the OS dies together with NiFi in the event of a failure, the cached data may be lost. If we absolutely want to eliminate caching, we can set the file's repositories to always sync to disc. This, on the other hand, can be a severe impediment to performance. If NiFi goes down, it will have no effect on data because the OS will still be responsible for flushing the cached data to the disc.


Ques: 16). What Is The Nifi System's Backpressure?


Occasionally, the producer system outperforms the consumer system. As a result, the messages consumed are slower. As a result, all unprocessed communications (FlowFiles) will be stored in the connection buffer. However, you can set a restriction on the magnitude of the connection backpressure based on the number of FlowFiles or the quantity of the data. If it exceeds a predetermined limit, the link will send back pressure to the producing processor, causing it to stop working. As a result, until the backpressure is removed, no new FlowFiles will be generated.


Ques: 17). What Is Bulleting In Nifi And How Does It Help?


If you want to know if a dataflow has any issues. You can look through the logs for anything intriguing, but having notifications appear on the screen is far more convenient. A "Bulletin Indicator" will appear in the top-right-hand corner of the Processor if it logs anything as a WARNING or ERROR.

This sign, which resembles a sticky note, will appear for five minutes after the incident has occurred. By hovering over the bulletin, the user can get information about what happened without having to search through log messages. If in a cluster, the bulletin will also indicate which node in the cluster emitted the bulletin. We can also change the log level at which bulletins will occur in the Settings tab of the Configure dialog for a Processor.


Ques: 18). When Nifi pulls data, do the attributes get added to the content (real data)?


You may absolutely add attributes to your FlowFiles at any moment; after all, the purpose of separating metadata from actual data is to allow you to do so. A FlowFile is a representation of an object or a message travelling via NiFi. Each FlowFile has a piece of content, which are the bytes themselves. The properties can then be extracted from the material and stored in memory. You can then use those properties in memory to perform operations without having to touch your content. You can save a lot of IO overhead this way, making the entire flow management procedure much more efficient.


Ques: 19). What prioritisation scheme is utilised if no prioritizers are set in a processor?


The default priority strategy is described as "undefined," and it is subject to change. If no prioritizers are specified, the processor will order the data using the Content Claim of the FlowFile. It delivers the most efficient data reading and the highest throughput this way. We've debated changing the default setting to First In First Out, but for now, we're going with what works best.


Ques: 20). If no prioritizer square measure set in a very processor, what prioritization plot is used?


The default prioritization theme is claimed to be undefined, and it’s going to regulate from time to era. If no prioritizer square measure set, the processor can kind the info supported the FlowFiles Content Claim. This habit provides the foremost economical reading of the info and therefore the highest output. we’ve got mentioned dynamical the default feels to initial In initial Out, however, straight away it’s primarily based happening for what offers the most effective do its stuff.

This square measure a number of the foremost normally used interview queries vis–vis Apache NiFi. To go surfing a lot of terribly regarding Apache NiFi you’ll be able to check the class Apache NiFi and entertain reach purchase the newssheet for a lot of connected articles.

Top 20 Apache Spark Interview Questions & Answers


Ques: 1). What is Apache Spark?


Apache Spark is an open-source real-time processing cluster computing framework. It has a vibrant open-source community and is now the most active Apache project. Spark is a programming language that allows you to programme large clusters with implicit data parallelism and fault tolerance.

Spark is one of the Apache Software Foundation's most successful projects. Spark has unquestionably established itself as the market leader in Big Data processing. Spark is used by many enterprises on clusters with thousands of nodes. Spark is now used by large companies like as Amazon, eBay, and Yahoo!

BlockChain Interview Question and Answers

Ques: 2). What advantages does Spark have over MapReduce?


Compared to MapReduce, Spark has the following advantages:

Spark implements processing 10 to 100 times quicker than Hadoop MapReduce due to the availability of in-memory processing, whereas MapReduce uses persistence storage for any of the data processing activities.

Unlike Hadoop, Spark has built-in libraries that allow it to do a variety of functions from the same core, including as batch processing, steaming, machine learning, and interactive SQL queries. Hadoop, on the other hand, only supports batch processing.

Hadoop is heavily reliant on discs, but Spark encourages caching and data storage in memory. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

Apache Hive Interview Questions & Answers

Ques: 3). What exactly is YARN?


YARN is a fundamental element in Spark, similar to Hadoop, in that it provides a central and resource management platform for delivering scalable operations across the cluster. YARN, like Mesos, is a distributed container manager, but Spark is a data processing tool. Spark can be run on YARN in the same way that Hadoop Map Reduce can. Running Spark on YARN needs the use of a Spark binary distribution with YARN support.

 Apache Ambari interview Questions & Answers

Ques: 4). What's the difference between an RDD, a Dataframe, and a Dataset?


Resilient Distributed Dataset (RDD) - RDD stands for Resilient Distributed Dataset. It is the most basic data structure in Spark, consisting of an immutable collection of records partitioned among cluster nodes. It allows us to do fault-tolerant in-memory calculations on massive clusters.

RDD, unlike DF and DS, will not keep the schema. It merely stores data. If a user wants to apply a schema to an RDD, they must first build a case class and then apply the schema to the data.

We will use RDD for the below cases:

-When our data is unstructured, A streams of text or media streams.

-When we don’t want to implement any schema.

-When we don’t care about the column name attributes while processing or accessing.

-When we want to manipulate the data with functional programming constructs than domain specific expressions.

-When we want low-level transformation, actions and control on the dataset.


-Like RDD DataFrames are immutable collection of data.

-Unlike RDD DataFrame will have schema for their data making user to easily access/process large set of data which is distributed among the nodes of cluster.

-DataFrame provides a domain specific language API to manipulate distributed data and makes Spark accessible to a wider audience, beyond specialized data engineers.

-From Spark 2.x Spark DataFrames are nothing but Dataset[Row] or alias (Untyped API)

consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object


Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. To implement case class on RDD and use as Dataset[T].

 Apache Tapestry Interview Questions and Answers

Ques: 5). Can you explain how you can use Apache Spark along with Hadoop?


Apache Spark provides the benefit of being Hadoop compatible. They make a powerful tech team when they work together. Using Apache Spark and Hadoop combines the processing capability of Spark with the best capabilities of HDFS and YARN. The following are some examples of how to use Hadoop Components with Apache Spark:

Batch & Real-Time Processing – MapReduce and Spark can work together, where the former handles batch processing, and the latter handles real-time processing.

HDFS – Spark can make use of the HDFS to leverage the distributed replicated storage.

MapReduce – Apache Spark can be used in conjunction with MapReduce in a similar Hadoop cluster or independently as a processing framework.

YARN – You can run Spark applications on YARN.

 Apache NiFi Interview Questions & Answers

Ques: 6). What is the meaning of the term "spark lineage"?


• In Spark, regardless of the actual data, all dependencies between RDDs will be logged in a graph. In Spark, this is referred to as a lineage graph.

• RDD stands for Resilient Distributed Dataset, with the term "resilient" referring to fault tolerance. We can re-compute the missing or damaged partition due to node failure using RDD Lineage Graph. When we generate new RDDs based on existing RDDs, we use lineage graph spark to handle the dependencies. Each RDD keeps a pointer to one or more parents, together with metadata describing the type of relationship it has with the parent RDD.

• RDD Lineage Graph in Spark can be obtained using the ToDebugString method.


Ques: 7). List the various components of the Spark Ecosystem.


These are the five types of components in the Spark Ecosystem:

GraphX: Enables graphs and graph-parallel computation.

MLib: It is used for machine learning.

Spark Core: A powerful parallel and distributed processing platform.

Spark Streaming: Handles real-time streaming data.

Spark SQL: Combines Spark's functional programming API with relational processing.


Ques: 8). What is RDD in Spark? Write about it and explain it.


Resilient Distributed Dataset (RDD) is an acronym for Resilient Distributed Dataset. RDDs are a fault-tolerant core data structure in Spark that is immutable. They've disseminated partitioned datasets among the cluster nodes.

Parallelizing and referencing a data set are the two methods for constructing RDDS. Lazy evaluation is the responsibility of the RDDS. The faster processing performance in Spark is due to the lazy evaluation of RDDs.


Ques: 9). In Spark, how does streaming work?


Spark gets data in real time that is separated into batches. The Spark Engine processes these batches of data, and the final stream of results is returned back in batches. DStream, or Discretized Stream, is the most basic stream unit in Spark.


Ques: 10). Is it feasible to access and analyse data stored in Cassandra databases using Apache Spark?


Yes, Apache Spark may be used to retrieve and analyse data stored in Cassandra databases. Apache Spark can access and analyse data contained in Cassandra databases using the Spark Cassandra Connector. Spark should have a functionality that allows Spark executors to communicate with local Cassandra nodes and request just local data.

Cassandra and Apache Spark can be connected to speed up queries by lowering network traffic between Spark executors and Cassandra nodes.


Ques: 11). What are the advantages of using Spark SQL?


Spark SQL carries out the following tasks:

Loads data from a variety of structured datasources, such as a relational database management system (RDBMS).

It may query data using SQL commands within the Spark programme as well as JDBC/ODBC connectors from third-party tools such as Tableau.

It can also provide SQL and Python/Scala code interaction.


Ques: 12). What is the purpose of Spark Executor?


The Executors are obtained on top of worker nodes in the clusters when a SparkContext is formed. Spark Executors are in charge of performing computations and storing data on the worker node. They are also in charge of returning the results to the driver.


Ques: 13). What are the advantages and disadvantages of Spark?


Advantages: Spark is known for real-time data processing, which may be employed in applications such as stock market analysis, finance, and telecommunications.

Spark's stream processing allows for real-time data analysis, which can aid in fraud detection, system alarms, and other applications.

Due to its lazy evaluation mechanism and parallel processing, Spark processes data 10 to 100 times quicker.

Disadvantages: When compared to Hadoop, Spark consumes greater storage space.

The task is distributed over numerous clusters rather than taking place on a single node.

Spark's in-memory processing might be costly when dealing with large amounts of data.

When compared to Hadoop, Spark makes better use of data.


Ques: 14). What are some of the drawbacks of utilising Apache Spark?


The following are some of the drawbacks of utilising Apache Spark:

There is no file management system built-in. To take benefit of a file management system, integration with other platforms such as Hadoop is essential.

Higher latency, but lower throughput as a result

It does not support the processing of real-time data streams. In Apache Spark, live data streams are partitioned into batches, which are then processed and turned back into batches. To put it another way, Spark Streaming is more like micro-batch data processing than true real-time data processing.

There are fewer algorithms available.

Record-based window requirements are not supported by Spark streaming. It is necessary to distribute work across multiple clusters instead of running everything on a single node.

Apache Spark's in-memory ability becomes a bottleneck when used for the cost-efficient processing of big data.


Ques: 15). Is Apache Spark compatible with Apache Mesos?


Yes. Spark can work on Apache Mesos-managed clusters, just as it works on YARN-managed clusters. Spark may run without a resource manager in standalone mode. If it has to execute on multiple nodes, it can use YARN or Mesos.


Ques: 16). What are broadcast variables, and how do they work?


Accumulators and broadcast variables are the two types of shared variables in Spark. Instead of shipping back and forth to the driver, the broadcast variables are read-only variables cached in the Executors for local referencing. A broadcast variable preserves a read-only cached version of a variable on each computer instead of delivering a copy of the variable with tasks.

Additionally, broadcast variables are utilised to distribute a copy of a big input dataset to each node. To cut transmission costs, Apache Spark distributes broadcast variables using efficient broadcast algorithms.

There is no need to replicate variables for each task when using broadcast variables. As a result, data can be processed quickly. In contrast to RDD lookup(), broadcast variables assist in storing a lookup table inside the memory, enhancing retrieval efficiency.


Ques: 17). In Apache Spark, how does caching work?


Caching RDDs in Spark speeds up processing by allowing numerous accesses to the same RDD. The function of Discretized Streams, or DStreams, in Spark streaming is to allow users to cache or retain data in memory.

The functions cache () and persist(level) are used to cache data in memory and cache memory based on the storage level specified, respectively.

The persist () without the level specifier is the same as cache, which means it caches the data in memory. The persist(level) method caches data at the provided storage level, such as on disc, on the server, or in off-heap memory.


Ques: 18).  What exactly is Akka? What does Spark do with it?


Akka is a Scala and Java framework for reactive, distributed, parallel, and robust concurrent applications. Akka is the foundation for Apache Spark.

When assigning tasks to worker nodes, Spark employs Akka for job scheduling and messaging between the master and the worker node.


Ques: 19). What applications do you utilise Spark streaming for?


When real-time data must be streamed into the Spark programme, this method is employed. It can be broadcast from a variety of places, such as Kafka, Flume, Amazon Kinesis, and others. For processing, the streamed data is separated into batches.

Spark streaming is used to conduct real-time sentiment analysis of customers on social media sites like as Twitter and Facebook, among others.

Live streaming data processing is critical for detecting outages, detecting fraud in financial institutions, and making stock market predictions, among other things.


Ques: 20). What exactly do you mean when you say "lazy evaluation"?


The way Spark works with data is intellectual. When you ask Spark to perform a task on a dataset, it follows your instructions and records them so that it doesn't forget them - but it doesn't do anything until you tell it to. When map() is invoked on an RDD, the operation is not done immediately. Transformations aren't evaluated by Spark until you use them. This aids in the overall data processing workflow optimization.