Ques: 1. What is the need of Hadoop?
Ans: A large amount of unstructured data is getting dumped into our machines in every single day. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations.
In this situation, a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost-effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.
Ques: 2.
What is the Hadoop Framework?
Ans: It is an open
source framework, which is written in java by Apache software foundation. This
framework is used to write software application which requires to process vast
amount of data. Hadoop could handle multi tera bytes of data. It works
in-parallel on large clusters which could have 1000 of computers (Nodes) on the
clusters. It also processes data very reliably and fault-tolerant manner.
Ques: 3.
What are the various basic characteristics of Hadoop?
Ans: Hadoop
framework has the capability of solving issues involving Big Data analysis. It
is written in Java. Its programming model is based on Google. MapReduce and
infrastructure is based on Google’s Big Data and distributed file systems.
Hadoop is scalable, and more nodes can be added to it.
Ques: 4. What are the core components of Hadoop?
Ans: HDFS and MapReduce are the core components of Hadoop. Hadoop Distributed File System (HDFS) is basically used to store large data sets and MapReduce is used to process such large data sets.
Ques: 5.
What do you understand by streaming access?
Ans: HDFS works on
the principle of ‘Write Once, Read Many’. This feature of streaming access is
extremely important in HDFS. In HDFS, reading the complete data is more
important than the time taken to fetch a single record from the data. HDFS
focuses not so much on storing the data but how to retrieve it at the fastest
possible speed, especially while analyzing logs.
Ques: 6.
What do you understand by a task tracker?
Ans: Task Trackers
manage the execution of individual tasks on slave node. Task tracker is also a
daemon that runs on DataNodes. When a client submits a job, the job tracker
will initialize the job and divide the work and assign them to different task
trackers to perform MapReduce tasks.
While
performing this action, the task tracker will be simultaneously communicating
with job tracker by sending heartbeat. If the job tracker does not receive
heartbeat from task tracker within specified time, then it will assume that
task tracker has crashed and assign that task to another task tracker in the
cluster.
Ques: 7.
If a particular file is 40 MB, Will the HDFS block still consume 64 MB as the default
size?
Ans: No, 64 mb is
just a unit where the data will be stored. In this particular situation, only 40
MB will be consumed by an HDFS block and 24 MB will be free to store something
else. It is the MasterNode that does data allocation in an efficient manner.
Ques: 8.
What is a Rack in Hadoop?
Ans: Rack is a
storage area with all the DataNodes put together. Rack is a physical collection
of DataNodes which are stored at a single location. There can be multiple racks
in a single location. These DataNodes can be physically located at different
places.
Ques: 9. How
will the data be stored on a Rack?
Ans: The content
of the file will be divided into blocks whenever the client is ready to load a
file into the cluster. Now the client consults the NameNode and gets 3 DataNodes
for every block of the file which indicates where the block should be stored.
While placing the DataNodes, the key rule followed is “for every block of data,
two copies will exist in one rack, third copy in a different rack”. This rule
is known as “Replica Placement Policy”.
Ques: 10.
Can you explain the input and output data format of the Hadoop Framework?
Ans: The MapReduce
framework operates exclusively on pairs, that is, the framework views the input
to the job as a set of pairs and produces a set of pairs as the output of the
job, conceivably of different types.
The flow can
be like: [input] -> map -> -> combine/sorting -> -> reduce ->
[output]
Ques: 11. How
can you use the Reducer?
Ans: Reducer
reduces a set of intermediate values which share a key to a (can be smaller one)
set of values. The number of reduces for the job is set by the user via
Job.setNumReduceTasks(int).
Ques: 12. How
can you explain the core methods of the Reducer?
Ans: The API of
Reducer is very similar to that of Mapper, there's a run() method that receives
a Context containing the job's configuration as well as interfacing methods
that return data from the reducer itself back to the framework. The run()
method calls setup() once, reduce() once for each key associated with the
reduce task, and cleanup() once at the end. Each of these methods can access
the job's configuration data by using Context.getConfiguration().
Reduce() method
is the heart of any Reducer. This is called once per key; the second argument
is an iteratable which returns all the values associated with that key.
Ques: 13.
How can you schedule a Task by a Jobtracker?
Ans: The
TaskTrackers send out heartbeat messages to the JobTracker, usually every few
minutes, to reassure the JobTracker that it is still alive. These messages also
inform the JobTracker of the number of available slots, so the JobTracker can
stay up to date with where in the cluster work can be delegated. When the
JobTracker tries to find somewhere to schedule a task within the MapReduce operations,
it first looks for an empty slot on the same server that hosts the DataNode
containing the data, and if not, it looks for an empty slot on a machine in the
same rack.
Ques: 14.
How many Daemon processes run on a Hadoop cluster?
Ans: There are
five daemons run on a Hadoop cluster. Each of these daemons runs in its own
JVM.
NameNode,
secondary NameNode and JobTracker Daemons run on Master nodes. DataNode and
TaskTracker run on each Slave nodes.
·
NameNode: This daemon stores and maintains
the metadata for HDFS.
·
Secondary
NameNode: Performs
housekeeping functions for the NameNode.
·
JobTracker: Manages MapReduce jobs,
distributes individual tasks to machines running the Task Tracker.
·
DataNode: Stores actual HDFS data blocks.
·
TaskTracker: It is Responsible for
instantiating and monitoring individual Map and Reduce tasks.
Ques: 15.
What is Hadoop Distributed File System (HDFS)? How it is different from
Traditional File Systems?
Ans: The Hadoop
Distributed File System (HDFS), is responsible for storing huge data on the
cluster. This is a distributed file system designed to run on commodity
hardware.
It has many
similarities with existing distributed file systems. However, the differences
from other distributed file systems are significant.
- HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
- HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
- HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.
Ques: 16.
What are the IdentityMapper and IdentityReducer in Mapreduce?
Ans:
- org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
- org.apache.hadoop.mapred.lib.IdentityReducer : Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.
Ques: 17.
What do you mean by commodity hardware? How can Hadoop work on them?
Ans: Average and
non-expensive systems are known as commodity hardware and Hadoop can be
installed on any of them. Hadoop does not require high end hardware to
function.
Ques: 18.
Which one is the Master Node in HDFS? Can it be commodity?
Ans: Name node is
the master node in HDFS and job tracker runs on it. The node contains metadata
and works as high availability machine and single pint of failure in HDFS. It
cannot be commodity as the entire HDFS works on it.
Ques: 19.
What is the main difference between Mapper and Reducer?
Ans: Map method is called
separately for each key/value have been processed. It processes input
key/value pairs and emits intermediate key/value pairs.
- Reduce method is called separately for each key/values list pair. It processes intermediate key/value pairs and emits final key/value pairs.
- Both are initialize and called before any other method is called. Both don’t have any parameters and no output.
Ques: 20.
What is difference between MapSide Join and ReduceSide Join?
Ans:
Joining
the multiple tables in mapper side, called map side join. Please note mapside
join should has strict format and sorted properly. If data set is smaller
tables, goes through reducer phrase. Data should be partitioned properly.
Join
the multiple tables in reducer side called reduceside join. If you have large
amount of data tables, planning to join both tables. One table is large
amount of rows and columns, another one has few number of tables only,
goes through Reduceside join. It’s the best way to join the multiple
tables.