Showing posts with label pig. Show all posts
Showing posts with label pig. Show all posts

May 11, 2022

Top 20 Apache Pig Interview Questions and Answers


            Pig is an Apache open-source project that runs on Hadoop and provides a parallel data flow engine. It contains the pig Latin language, which is used to express data flow. It includes actions such as sorting, joining, filtering, and scripting UDF (User Defined Functions) for reading, writing, and processing. Pig stores and processes the entire task using Map Reduce and HDFS.

Apache Kafka Interview Questions and Answers

Ques. 1): What benefits does Pig have over MapReduce?


The development cycle for MapReduce is extremely long. It takes a long time to write mappers and reducers, compile and package the code, submit tasks, and retrieve the results. Dataset joins are quite complex to perform. Low level and stiff, resulting in a large amount of specialised user code that is difficult to maintain and reuse is difficult.

Pig does not require the compilation or packaging of code. Pig operators will be turned into maps or jobs will be reduced internally. Pig Latin supports all common data-processing procedures, as well as high-level abstraction for processing big data sets.

Apache Struts 2 Interview Questions and Answers

Ques. 2): Is Piglatin a Typographically Strong Language? If so, how did you arrive at your conclusion?


In a strongly typed language, the type of all variables must be declared up front. When you explain the schema of the data in Apache Pig, it expects the data to be in the same format.

When the schema is unknown, however, the script will adjust to the actual data types at runtime. PigLatin can thus be described as firmly typed in most circumstances but gently typed in others, i.e. it continues to work with data that does not meet its expectations.

Apache Spark Interview Questions and Answers

Ques. 3): What are Pig's disadvantages?


Pig has a number of flaws, including:

Pig isn't the best choice for real-time applications.

When you need to get a single record from a large dataset, Pig isn't very useful.

It works in batches since it uses MapReduce.

Apache Hive Interview Questions and Answers

Ques. 4): What is Pig Storage, exactly?


Pig comes with a default load function called Pig Storage. Additionally, we may use pig storage to import data from a file system into the pig.

While loading data into pig storage, we may also provide the data delimiter (how the fields in the record are separated). We can also provide the data's schema as well as the data's type.

Apache Tomcat Interview Questions and Answers

Ques. 5): Explain Grunt in Pig and its characteristics.


The Grunt takes on the role of an Interactive Shell Pig. Grunt's main characteristics are:

To move the cursor to the end of a line, press the ctrl-e key combination.

As a Grunt retains command history, the lines in the history buffer can be recalled using the up and down cursor keys.

Grunt supports the auto-completion method by attempting to finish Pig Latin keywords and functions when the Tab key is hit.

Apache Drill Interview Questions and Answers

Ques. 6): What Does Pig Flatten Mean?


When there is data in a tuple or a bag, we may use the Flatten modifier in Pig to remove the level of nesting from that data. Un-nests bags and tuples should be flattened. The Flatten operation for tuples will substitute the fields of a tuple for a tuple, however un-nesting bags is a little more complicated because it necessitates the creation of new tuples.

Apache Ambari interview Questions and Answers

Ques. 7): Can you distinguish between logical and physical plans?


Pig goes through a few processes while converting a Pig Latin Script into MapReduce jobs. Pig generates a logical plan after performing basic parsing and semantic testing. Pig's logical plan, which is executed during execution, describes the logical operators. Pig then generates a physical plan. The physical plan specifies the physical operators required to execute the script.

Apache Tapestry Interview Questions and Answers

Ques. 8): In Pig, what does a co-group do?


Co-group unites the data collection by grouping only one of the data sets. It then groups the elements by their common field and provides a set of records with two distinct bags. The records of the first data set with the common data set are in the first bag, and the records of the second data set with the same data set are in the second bag.

Apache Ant Interview Questions and Answers

Ques. 9): Explain the bag.


Pig includes several data models, including a bag. The bag is an unorganised collection of tuples with possibly duplicates that is used to store collections while they are being grouped. The size of the bag is equal to the size of the local disc, implying that the bag's size is limited. When the bag is full, Pig will empty it onto the local disc and only maintain a portion of it in memory. It is not necessary for the entire bag to fit into memory. With ", we signify bags.

Apache Camel Interview Questions and Answers

Ques. 10): Can you describe the similarities and differences between Pig and Hive?


Both Hive and Pig have similar characteristics.

Both internally transform the commands to MapReduce.

High-level abstractions are provided by both technologies.

Low-latency queries are not supported by either.

OLAP and OLTP are not supported by either.

Apache Cassandra Interview Questions and Answers

Ques. 11): How do Apache Pig and SQL compare?


The use of Apache Pig for ETL, lazy evaluation, storing data at any stage in the pipeline, support for pipeline splits, and explicit specification of execution plans set it apart from SQL. SQL is built around queries that return only one result. SQL doesn't have a built-in mechanism for separating a data processing stream into sub-streams and applying various operators to each one.

User code can be added at any step in the pipeline with Apache Pig, whereas with SQL, data must first be put into the database before the cleaning and transformation process can begin.

Apache NiFi Interview Questions and Answers

Ques. 12): Can Apache Pig Scripts Join Multiple Fields?


Yes, several fields can be joined in PIG scripts since join procedures take records from one input and combine them with records from another. This is accomplished by specifying the keys for each input and joining the two rows when the keys are equal.

Apache Storm Interview Questions and Answers

Ques. 13): What is the difference between the commands store and dumps?


After running the dump command, the data appears on the console, but it is not saved. Whereas the output is executed in a folder and the store is stored in the local file system or HDFS. Most hadoop developers utilised the'store' command to store data in HDFS in a protected environment.

Apache Flume Interview Questions and Answers

Ques. 14):  Is 'FUNCTIONAL' a User Defined Function (UDF)?


No, the keyword 'FUNCTIONAL' does not represent a User Defined Function (UDF). Some functions must be overridden while using UDF. You must certainly complete your tasks using only these functions. However, because the keyword 'FUNCTIONAL' is a built-in function (a pre-defined function), it cannot be used as a UDF.


Ques. 15): Which method must be overridden when writing evaluate UDF?


When developing UDF in Pig, we must override the method exec(). While the base class may change, when developing filter UDF, we must extend FilterFunc, and when writing evaluate UDF, we must extend EvalFunc. EvaluFunc is parameterized, and the return type must be specified as well.


Ques. 16): What role does MapReduce play in Pig programming?


Pig is a high-level framework that simplifies the execution of various Hadoop data analysis problems. A Pig Latin programme is similar to a SQL query that is executed using an execution engine. The Pig engine can convert programmes into MapReduce jobs, with MapReduce serving as the execution engine.


Ques. 17): What Debugging Tools Are Available For Apache Pig Scripts?


The essential debugging utilities in Apache Pig are describe and explain.

When trying to troubleshoot or optimise PigLatin scripts, Hadoop developers will find the explain function useful. In the grunt interactive shell, explain can be applied to a specific alias in the script or to the entire script. The explain programme generates multiple text-based graphs that can be printed to a file.

When building Pig scripts, the describe debugging utility is useful since it displays the schema of a relation in the script. Beginners learning Apache Pig can use the describe utility to see how each operator alters data. A pig script can have multiple describes.


Ques. 18): What are the relation operations in Pig? Explain any two with examples.


The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );B = foreach A generate emp_name, emp_id;Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.

Example: M = FILTER N BY F5 == 50;


Ques. 19): What are some Apache Pig use cases that come to mind?


The Apache Pig large data tools are used for iterative processing, raw data exploration, and standard ETL data pipelines. Pig is commonly used by researchers who want to use the data before it is cleansed and placed into the data warehouse because it can operate in situations where the schema is unknown, inconsistent, or incomplete.

It can be used by a website to track the response of users to various sorts of adverts, photos, articles, and so on in order to construct behaviour prediction models.


Ques. 20): In Apache Pig, what is the purpose of illustrating?


Illustrate is used to run Pig scripts on large datasets, which might take a long time. That is why developers run pig scripts on sample data, even though it is probable that the sample data selected will not execute the script correctly. If the script includes a join operator, for example, there must be a small number of records in the sample data with the same key, or the join operation will fail. Developers manage these issues by using the function illustrate, which takes data from the sample and ensures that some records pass through while others are restricted by modifying records in such a way that they follow the condition set whenever it encounters operators like the filter or join, which remove data. Illustrate displays each step's output but does not run MapReduce operations.




November 17, 2021

Top 20 Apache Ambari interview Questions & Answers


Ques: 1). Describe Apache Ambari's main characteristics.


Apache Ambari is an Apache product that was created with the goal of making Hadoop applications easier to manage. Ambari assists in the management of the Hadoop project.

  • Provisioning is simple.
  • Project management made simple
  • Monitoring of Hadoop clusters
  • Availability of a user-friendly interface
  • Hadoop management web UI
  • RESTful API support

 Apache Tapestry Interview Questions and Answers

Ques: 2). Why do you believe Apache Ambari has a bright future?


With the growing need for big data technologies like Hadoop, we've witnessed a surge in data analysis, resulting in gigantic clusters. Companies are turning to technologies like Apache Ambari for better cluster management, increased operational efficiency, and increased visibility. Furthermore, we've noted how HortonWorks, a technology titan, is working on Ambari to make it more scalable. As a result, learning Hadoop as well as technologies like Apache Ambari is advantageous.

 Apache NiFi Interview Questions & Answers

Ques: 3). What are the core benefits for Hadoop users by using Apache Ambari?


The Apache Ambari is a great gift for individuals who use Hadoop in their day to day work life. With the use of Ambari, Hadoop users will get the core benefits:

1. The installation process is simplified
2. Configuration and overall management is simplified
3. It has a centralized security setup process
4. It gives out full visibility in terms of Cluster health
5. It is extensively extendable and has an option to customize if needed.

 Apache Spark Interview Questions & Answers

Ques: 4). What Are The Checks That Should Be Done Before Deploying A Hadoop Instance?


Before actually deploying the Hadoop instance, the following checklist should be completed:

  • Check for existing installations
  • Set up passwordless SSH
  • Enable NTP on the clusters
  • Check for DNS
  • Disable the SELinux
  • Disable iptables

 Apache Hive Interview Questions & Answers

Ques: 5 As a Hadoop user or system administrator, why should you choose Apache Ambari?


Using Apache Ambari can provide a Hadoop user with a number of advantages.

A system administrator can use Ambari to – Install Hadoop across any number of hosts using a step-by-step guide supplied by Ambari, while Ambari handles Hadoop installation setup.

Using Ambari, centrally administer Hadoop services across the cluster.

Using the Ambari metrics system, efficiently monitor the state and health of a Hadoop cluster. Furthermore, the Ambari alert framework sends out timely notifications for any system difficulties, like as disc space issues or node status.


Ques: 6). Can you explain Apache Ambari architecture?


Apache Ambari consists of following major components-

  • Ambari Server
  • Ambari Agent
  • Ambari Web

Apache Ambari Architecture

The all metadata is handled by the Ambari server, which is made up of a Postgres database instance as indicated in the diagram. The Ambari agent is installed on each computer in the cluster, and the Ambari server manages each host through it.

An Ambari agent is a member of the host that delivers heartbeats from the nodes to the Ambari server, as well as numerous operational metrics, to determine the nodes' health condition.

Ambari Web UI is a client-side JavaScript application that performs cluster operations by regularly accessing the Ambari RESTful API. Furthermore, using the RESTful API, it facilitates asynchronous communication between the application and the server.


Ques: 7). Apache Ambari supports how many layers of Hadoop components, and what are they?


Apache Ambari supports three tiers of Hadoop components, which are as follows:

1. Hadoop core components

  • Hadoop Distributed File System (HDFS)
  • MapReduce

2. Essential Hadoop components

  • Apache Pig
  • Apache Hive
  • Apache HCatalog
  • WebHCat
  • Apache HBase
  • Apache ZooKeeper

3. Components of Hadoop support

  • Apache Oozie
  • Apache Sqoop
  • Ganglia
  • Nagios


Ques: 8). What different sorts of Ambari repositories are there?


Ambari Repositories are divided into four categories, as below:

  1. Ambari: Ambari server, monitoring software packages, and Ambari agent are all stored in this repository.
  2. HDP-UTILS: The Ambari and HDP utility packages are stored in this repository.
  3. HDP: Hadoop Stack packages are stored in this repository.
  4. EPEL (Enterprise Linux Extra Packages): The Enterprise Linux repository now includes an extra set of software.


Ques: 9). How can I manually set up a local repository?


When there is no active internet connection available, this technique is used. Please follow the instructions below to create a local repository:

1. First and foremost, create an Apache httpd host.
2. Download a Tarball copy of each repository's entire contents.
3. After it has been downloaded, the contents must be extracted.


Ques: 10). What is a local repository, and when are you going to utilise one?


A local repository is a hosted place for Ambari software packages in the local environment. When the enterprise clusters have no or limited outbound Internet access, this is the method of choice.


Ques: 11). What are the benefits of setting up a local repository?


First and foremost by setting up a local repository, you can access Ambari software packages without internet access. Along with that, you can achieve benefits like –

Enhanced governance with better installation performance

Routine post-installation cluster operations like service start and restart operations


Ques: 12). What are the new additions in Ambari 2.6 versions?


Ambari 2.6.2 added the following features:

  • It will protect Zeppelin Notebook SSL credentials
  • We can set appropriate HTTP headers to use Cloud Object Stores with HDP
  • Ambari 2.6.1 added the following feature:
  • Conditional Installation of  LZO packages through Ambari
  • Ambari 2.6.0 added the following features:
  • Distributed mode of Ambari Metrics System’s (AMS) along with multiple Collectors
  • Host Recovery improvements for the restart
  • moving masters with minimum impact and scale testing
  • Improvement in Data Archival & Purging in Ambari Infra


Ques: 13). List Out The Commands That Are Used To Start, Check The Progress And Stop The Ambari Server?

Answer :

The following are the commands that are used to do the following activities:

To start the Ambari server

ambari-server start

To check the Ambari server processes

ps -ef | grep Ambari

To stop the Ambari server

ambari-server stop


Ques: 14). What all tasks you can perform for managing host using Ambari host tab?


Using Hosts tab, we can perform the following tasks:

  • Analysing Host Status
  • Searching the Hosts Page
  • Performing Host related Actions
  • Managing Host Components
  • Decommissioning a Master node or Slave node
  • Deleting a Component
  • Setting up Maintenance Mode
  • Adding or removing Hosts to a Cluster
  • Establishing Rack Awareness


Ques: 15). What all tasks you can perform for managing services using Ambari service tab?


Using Services tab, we can perform the following tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings change
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS management
  • Atlas management in a Storm Environment


Ques: 16). Is there a relationship between the amount of free RAM and disc space required and the number of HDP cluster nodes?


Without a doubt, it has. The amount of RAM and disc required depends on the number of nodes in your cluster. In typically, 1 GB of memory and 10 GB of disc space are required for each node. Similarly, for a 100-node cluster, 4GB of memory and 100GB of disc space are required. To get all of the details, you'll need to look at a specific version.


Ques: 17). What tasks you can skill for managing services using the Ambari subsidiary bank account?


using the Services report, we can do the bearing in mind tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings regulate
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS presidency
  • Atlas approach in a Storm Environment


Ques: 18). What is the best method for installing the Ambari agent on all 1000 hosts in the HDP cluster?


Because the cluster contains 1000 nodes, we should not manually install the Ambari agent on each node. Instead, we should set up a password-less ssh connection between the Ambari host and all of the cluster's nodes. To remotely access and install the Ambari Agent, Ambari Server hosts employ SSH public key authentication.


Ques: 19). What can I do if I have admin capabilities in Ambari?


Becoming a Hadoop Administrator is a difficult job. On, you can find all of the available Hadoop Admin training for HDP, Cloudera, and other platforms (visit now). You can create a cluster, manage the users in that cluster, and create groups if you are an Ambari Admin. All of these permissions are granted to the default admin user. You can grant the same or different permissions to another user even if you are an Amabari administrator.


Ques: 20).  How is recovery achieved in Ambari?


Recovery happens in Ambari in the moreover ways:

Based in remarks to activities

In Ambari after a restart master checks for pending undertakings and reschedules them previously all assimilation out is persisted here. Also, the master rebuilds the come clean machines at the back there is a restart, as the cluster market is persisted in the database. While lawsuit beautifies master actually catastrophe in the in front recording their take keep busy, along amid there is a race condition. The events, on the other hand, should be idempotent, which is a unique consideration. And the master restarts any behavior that has not been marked as occurring or has failed in the database. These persistent behaviors are seen in Redo Logs.

Based approaching the desired make known

While the master attempts to make the cluster flesh and blood publicise, you will be encircled by more to in as per the intended freshen appendix, as the master persists in the desired own going in savings account to for of the cluster.