Showing posts with label mapreduce. Show all posts
Showing posts with label mapreduce. Show all posts

November 17, 2021

Top 20 Apache Spark Interview Questions & Answers

  

Ques: 1). What is Apache Spark?

Answer:

Apache Spark is an open-source real-time processing cluster computing framework. It has a vibrant open-source community and is now the most active Apache project. Spark is a programming language that allows you to programme large clusters with implicit data parallelism and fault tolerance.

Spark is one of the Apache Software Foundation's most successful projects. Spark has unquestionably established itself as the market leader in Big Data processing. Spark is used by many enterprises on clusters with thousands of nodes. Spark is now used by large companies like as Amazon, eBay, and Yahoo!


BlockChain Interview Question and Answers


Ques: 2). What advantages does Spark have over MapReduce?

Answer:

Compared to MapReduce, Spark has the following advantages:

Spark implements processing 10 to 100 times quicker than Hadoop MapReduce due to the availability of in-memory processing, whereas MapReduce uses persistence storage for any of the data processing activities.

Unlike Hadoop, Spark has built-in libraries that allow it to do a variety of functions from the same core, including as batch processing, steaming, machine learning, and interactive SQL queries. Hadoop, on the other hand, only supports batch processing.

Hadoop is heavily reliant on discs, but Spark encourages caching and data storage in memory. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

Apache Hive Interview Questions & Answers

Ques: 3). What exactly is YARN?

Answer:

YARN is a fundamental element in Spark, similar to Hadoop, in that it provides a central and resource management platform for delivering scalable operations across the cluster. YARN, like Mesos, is a distributed container manager, but Spark is a data processing tool. Spark can be run on YARN in the same way that Hadoop Map Reduce can. Running Spark on YARN needs the use of a Spark binary distribution with YARN support.

 Apache Ambari interview Questions & Answers

Ques: 4). What's the difference between an RDD, a Dataframe, and a Dataset?

Answer:

Resilient Distributed Dataset (RDD) - RDD stands for Resilient Distributed Dataset. It is the most basic data structure in Spark, consisting of an immutable collection of records partitioned among cluster nodes. It allows us to do fault-tolerant in-memory calculations on massive clusters.

RDD, unlike DF and DS, will not keep the schema. It merely stores data. If a user wants to apply a schema to an RDD, they must first build a case class and then apply the schema to the data.

We will use RDD for the below cases:

-When our data is unstructured, A streams of text or media streams.

-When we don’t want to implement any schema.

-When we don’t care about the column name attributes while processing or accessing.

-When we want to manipulate the data with functional programming constructs than domain specific expressions.

-When we want low-level transformation, actions and control on the dataset.

DataFrame:

-Like RDD DataFrames are immutable collection of data.

-Unlike RDD DataFrame will have schema for their data making user to easily access/process large set of data which is distributed among the nodes of cluster.

-DataFrame provides a domain specific language API to manipulate distributed data and makes Spark accessible to a wider audience, beyond specialized data engineers.

-From Spark 2.x Spark DataFrames are nothing but Dataset[Row] or alias (Untyped API)

consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object

DataSet:

Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. To implement case class on RDD and use as Dataset[T].

 Apache Tapestry Interview Questions and Answers

Ques: 5). Can you explain how you can use Apache Spark along with Hadoop?

Answer:

Apache Spark provides the benefit of being Hadoop compatible. They make a powerful tech team when they work together. Using Apache Spark and Hadoop combines the processing capability of Spark with the best capabilities of HDFS and YARN. The following are some examples of how to use Hadoop Components with Apache Spark:

Batch & Real-Time Processing – MapReduce and Spark can work together, where the former handles batch processing, and the latter handles real-time processing.

HDFS – Spark can make use of the HDFS to leverage the distributed replicated storage.

MapReduce – Apache Spark can be used in conjunction with MapReduce in a similar Hadoop cluster or independently as a processing framework.

YARN – You can run Spark applications on YARN.

 Apache NiFi Interview Questions & Answers

Ques: 6). What is the meaning of the term "spark lineage"?

Answer:

• In Spark, regardless of the actual data, all dependencies between RDDs will be logged in a graph. In Spark, this is referred to as a lineage graph.

• RDD stands for Resilient Distributed Dataset, with the term "resilient" referring to fault tolerance. We can re-compute the missing or damaged partition due to node failure using RDD Lineage Graph. When we generate new RDDs based on existing RDDs, we use lineage graph spark to handle the dependencies. Each RDD keeps a pointer to one or more parents, together with metadata describing the type of relationship it has with the parent RDD.

• RDD Lineage Graph in Spark can be obtained using the ToDebugString method.

 

Ques: 7). List the various components of the Spark Ecosystem.

Answer:

These are the five types of components in the Spark Ecosystem:

GraphX: Enables graphs and graph-parallel computation.

MLib: It is used for machine learning.

Spark Core: A powerful parallel and distributed processing platform.

Spark Streaming: Handles real-time streaming data.

Spark SQL: Combines Spark's functional programming API with relational processing.

 

Ques: 8). What is RDD in Spark? Write about it and explain it.

Answer:

Resilient Distributed Dataset (RDD) is an acronym for Resilient Distributed Dataset. RDDs are a fault-tolerant core data structure in Spark that is immutable. They've disseminated partitioned datasets among the cluster nodes.

Parallelizing and referencing a data set are the two methods for constructing RDDS. Lazy evaluation is the responsibility of the RDDS. The faster processing performance in Spark is due to the lazy evaluation of RDDs.

 

Ques: 9). In Spark, how does streaming work?

Answer:

Spark gets data in real time that is separated into batches. The Spark Engine processes these batches of data, and the final stream of results is returned back in batches. DStream, or Discretized Stream, is the most basic stream unit in Spark.

 

Ques: 10). Is it feasible to access and analyse data stored in Cassandra databases using Apache Spark?

Answer: 

Yes, Apache Spark may be used to retrieve and analyse data stored in Cassandra databases. Apache Spark can access and analyse data contained in Cassandra databases using the Spark Cassandra Connector. Spark should have a functionality that allows Spark executors to communicate with local Cassandra nodes and request just local data.

Cassandra and Apache Spark can be connected to speed up queries by lowering network traffic between Spark executors and Cassandra nodes.

 

Ques: 11). What are the advantages of using Spark SQL?

Answer:

Spark SQL carries out the following tasks:

Loads data from a variety of structured datasources, such as a relational database management system (RDBMS).

It may query data using SQL commands within the Spark programme as well as JDBC/ODBC connectors from third-party tools such as Tableau.

It can also provide SQL and Python/Scala code interaction.

 

Ques: 12). What is the purpose of Spark Executor?

Answer:

The Executors are obtained on top of worker nodes in the clusters when a SparkContext is formed. Spark Executors are in charge of performing computations and storing data on the worker node. They are also in charge of returning the results to the driver.

 

Ques: 13). What are the advantages and disadvantages of Spark?

Answer:

Advantages: Spark is known for real-time data processing, which may be employed in applications such as stock market analysis, finance, and telecommunications.

Spark's stream processing allows for real-time data analysis, which can aid in fraud detection, system alarms, and other applications.

Due to its lazy evaluation mechanism and parallel processing, Spark processes data 10 to 100 times quicker.

Disadvantages: When compared to Hadoop, Spark consumes greater storage space.

The task is distributed over numerous clusters rather than taking place on a single node.

Spark's in-memory processing might be costly when dealing with large amounts of data.

When compared to Hadoop, Spark makes better use of data.

 

Ques: 14). What are some of the drawbacks of utilising Apache Spark?

Answer:

The following are some of the drawbacks of utilising Apache Spark:

There is no file management system built-in. To take benefit of a file management system, integration with other platforms such as Hadoop is essential.

Higher latency, but lower throughput as a result

It does not support the processing of real-time data streams. In Apache Spark, live data streams are partitioned into batches, which are then processed and turned back into batches. To put it another way, Spark Streaming is more like micro-batch data processing than true real-time data processing.

There are fewer algorithms available.

Record-based window requirements are not supported by Spark streaming. It is necessary to distribute work across multiple clusters instead of running everything on a single node.

Apache Spark's in-memory ability becomes a bottleneck when used for the cost-efficient processing of big data.

 

Ques: 15). Is Apache Spark compatible with Apache Mesos?

Answer:

Yes. Spark can work on Apache Mesos-managed clusters, just as it works on YARN-managed clusters. Spark may run without a resource manager in standalone mode. If it has to execute on multiple nodes, it can use YARN or Mesos.

 

Ques: 16). What are broadcast variables, and how do they work?

Answer:

Accumulators and broadcast variables are the two types of shared variables in Spark. Instead of shipping back and forth to the driver, the broadcast variables are read-only variables cached in the Executors for local referencing. A broadcast variable preserves a read-only cached version of a variable on each computer instead of delivering a copy of the variable with tasks.

Additionally, broadcast variables are utilised to distribute a copy of a big input dataset to each node. To cut transmission costs, Apache Spark distributes broadcast variables using efficient broadcast algorithms.

There is no need to replicate variables for each task when using broadcast variables. As a result, data can be processed quickly. In contrast to RDD lookup(), broadcast variables assist in storing a lookup table inside the memory, enhancing retrieval efficiency.

 

Ques: 17). In Apache Spark, how does caching work?

Answer:

Caching RDDs in Spark speeds up processing by allowing numerous accesses to the same RDD. The function of Discretized Streams, or DStreams, in Spark streaming is to allow users to cache or retain data in memory.

The functions cache () and persist(level) are used to cache data in memory and cache memory based on the storage level specified, respectively.

The persist () without the level specifier is the same as cache, which means it caches the data in memory. The persist(level) method caches data at the provided storage level, such as on disc, on the server, or in off-heap memory.

 

Ques: 18).  What exactly is Akka? What does Spark do with it?

Answer:

Akka is a Scala and Java framework for reactive, distributed, parallel, and robust concurrent applications. Akka is the foundation for Apache Spark.

When assigning tasks to worker nodes, Spark employs Akka for job scheduling and messaging between the master and the worker node.

 

Ques: 19). What applications do you utilise Spark streaming for?

Answer:

When real-time data must be streamed into the Spark programme, this method is employed. It can be broadcast from a variety of places, such as Kafka, Flume, Amazon Kinesis, and others. For processing, the streamed data is separated into batches.

Spark streaming is used to conduct real-time sentiment analysis of customers on social media sites like as Twitter and Facebook, among others.

Live streaming data processing is critical for detecting outages, detecting fraud in financial institutions, and making stock market predictions, among other things.

 

Ques: 20). What exactly do you mean when you say "lazy evaluation"?

Answer:

The way Spark works with data is intellectual. When you ask Spark to perform a task on a dataset, it follows your instructions and records them so that it doesn't forget them - but it doesn't do anything until you tell it to. When map() is invoked on an RDD, the operation is not done immediately. Transformations aren't evaluated by Spark until you use them. This aids in the overall data processing workflow optimization.




June 05, 2019

Top 20 Hadoop Technical Interview Questions & Answers



Ques: 1. What is the need of Hadoop?

Ans: A large amount of unstructured data is getting dumped into our machines in every single day. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations.
In this situation, a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost-effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.


Ques: 2. What is the Hadoop Framework?

Ans: It is an open source framework, which is written in java by Apache software foundation. This framework is used to write software application which requires to process vast amount of data. Hadoop could handle multi tera bytes of data. It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also processes data very reliably and fault-tolerant manner.


Ques: 3. What are the various basic characteristics of Hadoop?

Ans: Hadoop framework has the capability of solving issues involving Big Data analysis. It is written in Java. Its programming model is based on Google. MapReduce and infrastructure is based on Google’s Big Data and distributed file systems. Hadoop is scalable, and more nodes can be added to it.


Ques: 4. What are the core components of Hadoop?

Ans: HDFS and MapReduce are the core components of Hadoop. Hadoop Distributed File System (HDFS) is basically used to store large data sets and MapReduce is used to process such large data sets.


Ques: 5. What do you understand by streaming access?

Ans: HDFS works on the principle of ‘Write Once, Read Many’. This feature of streaming access is extremely important in HDFS. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs.


Ques: 6. What do you understand by a task tracker?

Ans: Task Trackers manage the execution of individual tasks on slave node. Task tracker is also a daemon that runs on DataNodes. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks.
While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.


Ques: 7. If a particular file is 40 MB, Will the HDFS block still consume 64 MB as the default size?

Ans: No, 64 mb is just a unit where the data will be stored. In this particular situation, only 40 MB will be consumed by an HDFS block and 24 MB will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.


Ques: 8. What is a Rack in Hadoop?

Ans: Rack is a storage area with all the DataNodes put together. Rack is a physical collection of DataNodes which are stored at a single location. There can be multiple racks in a single location. These DataNodes can be physically located at different places.


Ques: 9. How will the data be stored on a Rack?

Ans: The content of the file will be divided into blocks whenever the client is ready to load a file into the cluster. Now the client consults the NameNode and gets 3 DataNodes for every block of the file which indicates where the block should be stored. While placing the DataNodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as “Replica Placement Policy”.


Ques: 10. Can you explain the input and output data format of the Hadoop Framework?

Ans: The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
The flow can be like: [input] -> map -> -> combine/sorting -> -> reduce -> [output]


Ques: 11. How can you use the Reducer?

Ans: Reducer reduces a set of intermediate values which share a key to a (can be smaller one) set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).


Ques: 12. How can you explain the core methods of the Reducer?

Ans: The API of Reducer is very similar to that of Mapper, there's a run() method that receives a Context containing the job's configuration as well as interfacing methods that return data from the reducer itself back to the framework. The run() method calls setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end. Each of these methods can access the job's configuration data by using Context.getConfiguration().
Reduce() method is the heart of any Reducer. This is called once per key; the second argument is an iteratable which returns all the values associated with that key.


Ques: 13. How can you schedule a Task by a Jobtracker?

Ans: The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.


Ques: 14. How many Daemon processes run on a Hadoop cluster?

Ans: There are five daemons run on a Hadoop cluster. Each of these daemons runs in its own JVM.
NameNode, secondary NameNode and JobTracker Daemons run on Master nodes. DataNode and TaskTracker run on each Slave nodes.
·         NameNode: This daemon stores and maintains the metadata for HDFS.
·         Secondary NameNode: Performs housekeeping functions for the NameNode.
·         JobTracker: Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
·         DataNode: Stores actual HDFS data blocks.
·         TaskTracker: It is Responsible for instantiating and monitoring individual Map and Reduce tasks.


Ques: 15. What is Hadoop Distributed File System (HDFS)? How it is different from Traditional File Systems?

Ans: The Hadoop Distributed File System (HDFS), is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
  • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
  • HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
  • HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

Ques: 16. What are the IdentityMapper and IdentityReducer in Mapreduce?

Ans:
  • org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
  • org.apache.hadoop.mapred.lib.IdentityReducer : Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

Ques: 17. What do you mean by commodity hardware? How can Hadoop work on them?

Ans: Average and non-expensive systems are known as commodity hardware and Hadoop can be installed on any of them. Hadoop does not require high end hardware to function.


Ques: 18. Which one is the Master Node in HDFS? Can it be commodity?

Ans: Name node is the master node in HDFS and job tracker runs on it. The node contains metadata and works as high availability machine and single pint of failure in HDFS. It cannot be commodity as the entire HDFS works on it.


Ques: 19. What is the main difference between Mapper and Reducer?

Ans: Map method is called separately for each key/value have been processed. It processes input key/value pairs and emits intermediate key/value pairs.
  • Reduce method is called separately for each key/values list pair. It processes intermediate key/value pairs and emits final key/value pairs.
  • Both are initialize and called before any other method is called. Both don’t have any parameters and no output.

Ques: 20. What is difference between MapSide Join and ReduceSide Join?

Ans:
Joining the multiple tables in mapper side, called map side join. Please note mapside join should has strict format and sorted properly. If data set is smaller tables, goes through reducer phrase. Data should be partitioned properly.
Join the multiple tables in reducer side called reduceside join. If you have large amount of data tables, planning to join both tables. One table is large amount of rows and columns, another one has few number of tables only, goes through Reduceside join. It’s the best way to join the multiple tables.