Showing posts with label hive. Show all posts
Showing posts with label hive. Show all posts

November 17, 2021

Top 20 Apache Hive Interview Questions & Answers


Ques: 1). What is Apache Hive, and how does it work?

Answer:

Apache Hive is a Hadoop-based, sophisticated warehouse project. This platform focuses on data analysis and includes data query capabilities. Hive is comparable to SQL in that it provides a user interface for querying data stored in files and database systems. And Apache Hive is a popular data analysis and querying technology used by Fortune 500 companies around the world. When it is cumbersome or inefficient to run the logic in HiveQL, Hive allows standard map reduce programmes to customise mappers and reducers (User Defined Functions UDFS).

 

BlockChain Interview Question and Answers


Ques: 2). What is the purpose of Hive?

Answer:

Hive is a Hadoop tool that allows you to organise and query data in a database-like format, as well as write SQL-like queries. It can be used to access and analyse Hadoop data using SQL syntax.

 Apache Ambari interview Questions & Answers

Ques: 3). What are the differences between local and remote meta stores?

Answer:

Local meta store: When using the Local Meta store configuration, the specified meta store service, as well as the Hive service, will run on the same Java Virtual Machine (JVM) and connect to databases that are operating in distinct JVMs, either on the same machine or on a remote machine.

Remote meta store: The Meta store service and the Apache Hive service will execute on distinct JVMs in the Remote Meta store. To connect to meta store servers, all other processes use Thrift Network APIs. You can have many meta store servers in Remote meta store for high availability.

Apache Tapestry Interview Questions and Answers

Ques: 4). Explain the core difference between the external and managed tables?

Answer:

The following are the fundamental distinctions between managed and external tables:

When a managed table is dropped, the complete metadata and table data is lost. The Hive just deletes the metadata information associated with a table and leaves the table data in HDFS, whereas the external table is quite different.

Tables that are managed and tables that are external. Hive manages the data by default when you create a table, which means it moves the data into its warehouse directory. Alternatively, you can construct an external table, which instructs Hive to refer to data stored somewhere other than the warehouse directory.

The semantics of LOAD and DROP show the difference between the two table types. Let's start with a managed table. Data loaded into a managed table is stored in Hive's warehouse directory.

 Apache NiFi Interview Questions & Answers

Ques: 5). What is the difference between a read-only schema and a write-only schema?

Answer:

A table's schema is enforced at data load time in a conventional database. The data being loaded is rejected if it does not conform to the schema. Because the data is validated against the schema when it is written into the database, this architecture is frequently referred to as schema on write.

Hive, on the other hand, verifies data when it is loaded, rather than when it is queried. This is referred to as schema on read.

Between the two approaches, there are trade-offs. Because the data does not have to be read, parsed, and serialized to disc in the database's internal format, schema on read allows for a very quick first load. A file copy or move is all that is required for the load procedure. It's also more adaptable: think of having two schemas for the same underlying data, depending on the analysis. (External tables can be used in Hive for this; see Managed Tables and External Tables.)

Because the database can index columns and compress the data, schema on write makes query time performance faster. However, it takes longer to load data into the database as a result of this trade-off. Furthermore, in many cases, the schema is unknown at load time, thus no indexes can be applied because the queries have not yet been formed. Hive really shines in these situations.

 Apache Spark Interview Questions & Answers

Ques: 6). Write a query to insert a new column? Can you add a column with a default value in Hive?

Answer:

ALTER TABLE test1 ADD COLUMNS (access_count1 int); You cannot add a column with a default value in Hive. The addition of the column has no effect on the files that support your table. Hive interprets NULL as the value for every cell in that column in order to deal with the "missing" data.

In Hive, you must effectively recreate the entire table, this time with the column filled in. It's possible that rerunning your original query with the additional column will be easier. Alternatively, you might add the column to the table you already have, then select all of its columns plus the new column's value.


Ques: 7). What is the purpose of Hive's DISTRIBUTED BY clause?

Answer:

DISTRIBUTE BY determines how map output is split between reducers. By default, MapReduce computes a hash on the keys output by mappers and uses the hash values to try to distribute the key-value pairs evenly among the available reducers. Let's say we want all of the data for each value in a column to be collected at the same time. To ensure that the records for each get to the same reducer, we can use DISTRIBUTE BY. In the same way that GROUP BY determines how reducers receive rows for processing, DISTRIBUTE BY does the same.

If the DISTRIBUTE BY and SORT BY clauses are in the same query, Hive expects the DISTRIBUTE BY clause to be before the SORT BY clause. When you have a memory-intensive job, DISTRIBUTE BY is a helpful workaround because it requires Hadoop to employ Reducers instead of having a Map-only job. Essentially, Mappers gather data depending on the DISTRIBUTE BY columns supplied, reducing the framework's overall workload, and then transmit these aggregates to Reducers.

 

Ques: 8). What occurs when you perform a query in HIVE, please?

Answer:

The Query Planner examines the query and turns it to a Hadoop Map Reduce job’s DAG (Directed Acyclic Graph).

The jobs are submitted to the Hadoop cluster in the order that the DAG suggests.

Only mappers are used for simple queries. The Input Output format is in charge of splitting an input and reading data from HDFS. After that, the data is sent to a layer called SerDe (Serializer Deserializer). The deserializer part of the SerdDe converts data as a byte stream to a structured format in this example.

Reducers will be included in Map Reduce jobs for aggregate queries. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

 

Ques: 9). What is the importance of STREAM TABLE?

Answer:

When you need information from several tables, joins are useful, but when you have 1.5 billion or more data in one table and want to link it to a master table, the order of the joining tables is crucial.

Consider the following scenario: 

select foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a; 

Because Hive streams the right-most table (bar) and buffers other tables (foo) in memory before executing map-side/reduce-side joins. As a result, if you buffer 1.5 billion or more records, your join query will fail since 1.5 billion records will very certainly fill up Java-Heap space exception. 

So, to overcome this limitation and free the user to remember the order of joining tables based on their record-size, Hive provides a key-word /*+ STREAMTABLE(foo) */ which tells Hive Analyzer to stream table foo.

select /*+ STREAMTABLE(foo) */ foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;

Hence, in this way user can be free of remembering the order of joining tables.

 

Ques: 10). When is it appropriate to use SORT BY instead of ORDER BY?

Answer:

When working with huge volumes of data in Apache Hive, we use SORT BY instead of ORDER BY. The fact that SORT BY comes with numerous reducers is one of the reasons for utilising it. This cuts down on the amount of time it takes to complete the task. ORDER BY, on the other hand, consists of only one reduce, which means the process takes longer than usual to complete.

 

Ques: 11). What is the purpose of Hive's Partitioning function?

Answer:

Partitioning allows users to arrange data in the Hive table in the way they want it. As a result, the system would be able to scan only the relevant data rather than the complete data set.

Consider the following scenario: Assume we have transaction log data from a business website for years such as 2018, 2019, 2020, and so on. So, in this case, you can utilise the partition key to find data for a specified year, say 2019, which will reduce data scanning by removing 2018 and 2020.

 

Ques: 12). What is dynamic partitioning and how does it work?

Answer:

The values of partition columns are exposed during runtime in dynamic partitioning, i.e. the values are known when you load data into Hive tables. The following are some examples of how dynamic partitioning is commonly used:

To move data from a non-partitioned table to a partitioned table, which reduces latency and improves sampling.

 

Ques: 13). In hive, what's the difference between dynamic and static partitioning?

Answer:

Hive partitioning is highly beneficial for pruning data during queries in order to reduce query times.

When data is inserted into a table, partitions are produced. Partitions are required depending on how data is loaded. When loading files (especially large files) into Hive tables, static partitions are usually preferred. When compared to dynamic partition, this saves time when loading data. You "statically" create a partition in the table and then move the file into that partition. 

Because the files are large, they are typically created on HDFS. Without reading the entire large file, you can retrieve the partition column value from the filename, date, and so on. In the case of dynamic partitioning, the entire large file is read, i.e. every row of data is read, and the data is partitioned into the target tables using an MR job based on specified fields in the file.

Dynamic partitions are typically handy when doing an ETL operation in your data pipeline. For example, suppose you use the transfer command to load a large file into Table X. Then you run an idle query into Table Y and split data based on table X fields such as day and country. You could wish to execute an ETL step to partition the data in Table Y's nation partition into a Table Z where the data is partitioned based on cities for a specific country alone, and so on.

Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.

 

Ques: 14).What is ObjectInspector in Hive?

Answer:

The ObjectInspector is a feature that allows us to analyze individual columns and internal structure of a row object in Hive. This also provides a seamless way to access complex objects that can be stored in varied formats in the memory.

  • A standard Java object
  • An instance of the Java class
  • A lazily initialized object

The ObjectInspector lets the users know the structure of an object and also helps in accessing the internal fields of an object.

 

Ques: 15). How does impala outperform hive in terms of query response time?

Answer:

Impala should be thought of as "SQL on HDFS," whereas Hive is more "SQL on Hadoop."

Impala, in other words, does not require Hadoop at whatsoever. It simply runs daemons on all of your nodes that store some of the data in HDFS, allowing these daemons to return data rapidly without having to conduct a full Map/Reduce process.

The rationale for this is that running a Map/Reduce operation has some overhead, so short-circuiting Map/Reduce completely can result in a significant reduction in runtime.

That stated, Impala is not a replacement for Hive; it is useful in a variety of situations. When compared to Hive, Impala does not support fault-tolerance, therefore if there is a problem during your query, it will be gone. I would recommend Hive for ETL processes where a single job failure would be costly, but Impala can be great for tiny ad-hoc queries, such as for data scientists or business analysts who just want to look at and study some data without having to develop substantial jobs.

 

Ques: 16). Explain the different components used in the Hive Query processor?

Answer:

Below mentioned is the list of Hive Query processors:

  • Metadata Layer (ql/metadata)
  • Parse and Semantic Analysis (ql/parse)
  • Map/Reduce Execution Engine (ql/exec)
  • Sessions (ql/session)
  • Type Interfaces (ql/typeinfo)
  • Tools (ql/tools)
  • Hive Function Framework (ql/udf)
  • Plan Components (ql/plan)
  • Optimizer (ql/optimizer)

 

Ques: 17). What is the difference between Hadoop Buffering and Hadoop Streaming?

Answer:

Using custom made python or shell scripts to implement your map-reduce logic is known as Hadoop Streaming. (Use the Hive TRANSFORM keyword, for example.)

In this context, Hadoop buffering refers to the phase in a map-reduce job of a Hive query with a join when records are read into the reducers after being sorted and grouped by the mappers. The author explains why you should order the join clauses in a Hive query so that the largest tables come last; this helps Hive implement joins more efficiently.

 

Ques: 18). How will the work be optimised by the map-side join?

Answer:

Let's pretend we have two tables, one of which is a little table. A Map Reduce local job will be generated before the original join Map Reduce task, which will read data from HDFS and put it into an in-memory hash table. It serialises the in-memory hash table into a hash table file after reading it.

The data in the hash table file is then moved to the Hadoop distributed cache, which populates these files to each mapper's local disc in the following stage, while the original join Map Reduce process is running. As a result, all mappers can reload this permanent hash table file into memory and perform the join operations as previously. 

The optimised map join's execution sequence is depicted in the diagram below. The short table just has to be read once after optimization. In addition, if many mappers are operating on the same system, the distributed cache only needs to send a single copy of the hash table file to this machine.

Advantages of using Map-side join:

Using Map-side join reduces the cost of sorting and combining data in theshuffle and reduces stages. The map-side join also aids task performance by reducing the time it takes to complete the assignment.

Disadvantages of Map-side join:

It is only suitable for use when one of the tables on which the map-side join operation is performed is small enough to fit into memory. As a result, performing a map-side join on tables with a lot of data in each of them isn't a good idea.

 

Ques: 19).What type of user defined functions exists in HIVE?

Answer:

A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.

A UDF must satisfy the following two properties:

  • A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
  • A UDF must implement at least one evaluate() method.

 

A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.
  • A UDAF must satisfy the following two properties:
  • A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF;
  • An evaluator must implement five methods:
    • init()
    • iterate()
    • terminatePartial()
    • merge()
    • terminate()

  • A UDTF operates on a single row and produces multiple rows — a table — as output.
  • A UDTF must be a subclass of org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
  • A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.
  • The initialize method is called by Hive to notify the UDTF the argument types to expect.
  • The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
  • Once initialize() has been called, Hive will give rows to the UDTF using the process() method.
  • While in process(), the UDTF can produce and forward rows to other operators by calling forward().
  • Lastly, Hive will call the close() method when all the rows have passed to the UDTF.

 

Ques: 20). Is the HIVE LIMIT clause truly random?

Answer:

Although the manual claims that it returns rows at random, this is not the case. Without any where/order by clause, it returns "selected rows at random" as they occur in the database. This doesn't imply it's truly random (or randomly picked), but it does suggest that the order in which the rows are returned can't be predicted.

It returns the last 5 rows of whatever you're picking from as soon as you slap an order by x DESC limit 5 on there. You'd have to use something like order by rand() LIMIT 1 to get rows returned at random.

However, if your indexes aren't set up correctly, it can slow things down. I usually do a min/max to get the IDs on the table, then a random number between them, then choose those records (in your instance, just one), which is usually faster than letting the database do the work, especially on a huge dataset.



Top 20 Apache Ambari interview Questions & Answers

  

Ques: 1). Describe Apache Ambari's main characteristics.

Answer:

Apache Ambari is an Apache product that was created with the goal of making Hadoop applications easier to manage. Ambari assists in the management of the Hadoop project.

  • Provisioning is simple.
  • Project management made simple
  • Monitoring of Hadoop clusters
  • Availability of a user-friendly interface
  • Hadoop management web UI
  • RESTful API support

 Apache Tapestry Interview Questions and Answers

Ques: 2). Why do you believe Apache Ambari has a bright future?

Answer:

With the growing need for big data technologies like Hadoop, we've witnessed a surge in data analysis, resulting in gigantic clusters. Companies are turning to technologies like Apache Ambari for better cluster management, increased operational efficiency, and increased visibility. Furthermore, we've noted how HortonWorks, a technology titan, is working on Ambari to make it more scalable. As a result, learning Hadoop as well as technologies like Apache Ambari is advantageous.

 Apache NiFi Interview Questions & Answers

Ques: 3). What are the core benefits for Hadoop users by using Apache Ambari?

Answer: 

The Apache Ambari is a great gift for individuals who use Hadoop in their day to day work life. With the use of Ambari, Hadoop users will get the core benefits:

1. The installation process is simplified
2. Configuration and overall management is simplified
3. It has a centralized security setup process
4. It gives out full visibility in terms of Cluster health
5. It is extensively extendable and has an option to customize if needed.

 Apache Spark Interview Questions & Answers

Ques: 4). What Are The Checks That Should Be Done Before Deploying A Hadoop Instance?

Answer:

Before actually deploying the Hadoop instance, the following checklist should be completed:

  • Check for existing installations
  • Set up passwordless SSH
  • Enable NTP on the clusters
  • Check for DNS
  • Disable the SELinux
  • Disable iptables

 Apache Hive Interview Questions & Answers

Ques: 5 As a Hadoop user or system administrator, why should you choose Apache Ambari?

Answer:

Using Apache Ambari can provide a Hadoop user with a number of advantages.

A system administrator can use Ambari to – Install Hadoop across any number of hosts using a step-by-step guide supplied by Ambari, while Ambari handles Hadoop installation setup.

Using Ambari, centrally administer Hadoop services across the cluster.

Using the Ambari metrics system, efficiently monitor the state and health of a Hadoop cluster. Furthermore, the Ambari alert framework sends out timely notifications for any system difficulties, like as disc space issues or node status.

 

Ques: 6). Can you explain Apache Ambari architecture?

Answer:

Apache Ambari consists of following major components-

  • Ambari Server
  • Ambari Agent
  • Ambari Web

Apache Ambari Architecture

The all metadata is handled by the Ambari server, which is made up of a Postgres database instance as indicated in the diagram. The Ambari agent is installed on each computer in the cluster, and the Ambari server manages each host through it.

An Ambari agent is a member of the host that delivers heartbeats from the nodes to the Ambari server, as well as numerous operational metrics, to determine the nodes' health condition.

Ambari Web UI is a client-side JavaScript application that performs cluster operations by regularly accessing the Ambari RESTful API. Furthermore, using the RESTful API, it facilitates asynchronous communication between the application and the server.

 

Ques: 7). Apache Ambari supports how many layers of Hadoop components, and what are they?

Answer: 

Apache Ambari supports three tiers of Hadoop components, which are as follows:

1. Hadoop core components

  • Hadoop Distributed File System (HDFS)
  • MapReduce

2. Essential Hadoop components

  • Apache Pig
  • Apache Hive
  • Apache HCatalog
  • WebHCat
  • Apache HBase
  • Apache ZooKeeper

3. Components of Hadoop support

  • Apache Oozie
  • Apache Sqoop
  • Ganglia
  • Nagios

 

Ques: 8). What different sorts of Ambari repositories are there?

Answer: 

Ambari Repositories are divided into four categories, as below:

  1. Ambari: Ambari server, monitoring software packages, and Ambari agent are all stored in this repository.
  2. HDP-UTILS: The Ambari and HDP utility packages are stored in this repository.
  3. HDP: Hadoop Stack packages are stored in this repository.
  4. EPEL (Enterprise Linux Extra Packages): The Enterprise Linux repository now includes an extra set of software.

 

Ques: 9). How can I manually set up a local repository?

Answer:

When there is no active internet connection available, this technique is used. Please follow the instructions below to create a local repository:

1. First and foremost, create an Apache httpd host.
2. Download a Tarball copy of each repository's entire contents.
3. After it has been downloaded, the contents must be extracted.

 

Ques: 10). What is a local repository, and when are you going to utilise one?

Answer:

A local repository is a hosted place for Ambari software packages in the local environment. When the enterprise clusters have no or limited outbound Internet access, this is the method of choice.

 

Ques: 11). What are the benefits of setting up a local repository?

Answer: 

First and foremost by setting up a local repository, you can access Ambari software packages without internet access. Along with that, you can achieve benefits like –

Enhanced governance with better installation performance

Routine post-installation cluster operations like service start and restart operations

 

Ques: 12). What are the new additions in Ambari 2.6 versions?

Answer:

Ambari 2.6.2 added the following features:

  • It will protect Zeppelin Notebook SSL credentials
  • We can set appropriate HTTP headers to use Cloud Object Stores with HDP
  • Ambari 2.6.1 added the following feature:
  • Conditional Installation of  LZO packages through Ambari
  • Ambari 2.6.0 added the following features:
  • Distributed mode of Ambari Metrics System’s (AMS) along with multiple Collectors
  • Host Recovery improvements for the restart
  • moving masters with minimum impact and scale testing
  • Improvement in Data Archival & Purging in Ambari Infra

 

Ques: 13). List Out The Commands That Are Used To Start, Check The Progress And Stop The Ambari Server?

Answer :

The following are the commands that are used to do the following activities:

To start the Ambari server

ambari-server start

To check the Ambari server processes

ps -ef | grep Ambari

To stop the Ambari server

ambari-server stop

 

Ques: 14). What all tasks you can perform for managing host using Ambari host tab?

Answer: 

Using Hosts tab, we can perform the following tasks:

  • Analysing Host Status
  • Searching the Hosts Page
  • Performing Host related Actions
  • Managing Host Components
  • Decommissioning a Master node or Slave node
  • Deleting a Component
  • Setting up Maintenance Mode
  • Adding or removing Hosts to a Cluster
  • Establishing Rack Awareness

 

Ques: 15). What all tasks you can perform for managing services using Ambari service tab?

Answer: 

Using Services tab, we can perform the following tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings change
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS management
  • Atlas management in a Storm Environment

 

Ques: 16). Is there a relationship between the amount of free RAM and disc space required and the number of HDP cluster nodes?

Answer: 

Without a doubt, it has. The amount of RAM and disc required depends on the number of nodes in your cluster. In typically, 1 GB of memory and 10 GB of disc space are required for each node. Similarly, for a 100-node cluster, 4GB of memory and 100GB of disc space are required. To get all of the details, you'll need to look at a specific version.

 

Ques: 17). What tasks you can skill for managing services using the Ambari subsidiary bank account?

Answer: 

using the Services report, we can do the bearing in mind tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings regulate
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS presidency
  • Atlas approach in a Storm Environment

 

Ques: 18). What is the best method for installing the Ambari agent on all 1000 hosts in the HDP cluster?

Answer: 

Because the cluster contains 1000 nodes, we should not manually install the Ambari agent on each node. Instead, we should set up a password-less ssh connection between the Ambari host and all of the cluster's nodes. To remotely access and install the Ambari Agent, Ambari Server hosts employ SSH public key authentication.

 

Ques: 19). What can I do if I have admin capabilities in Ambari?

Answer: 

Becoming a Hadoop Administrator is a difficult job. On HadoopExam.com, you can find all of the available Hadoop Admin training for HDP, Cloudera, and other platforms (visit now). You can create a cluster, manage the users in that cluster, and create groups if you are an Ambari Admin. All of these permissions are granted to the default admin user. You can grant the same or different permissions to another user even if you are an Amabari administrator.

 

Ques: 20).  How is recovery achieved in Ambari?

Answer:

Recovery happens in Ambari in the moreover ways:

Based in remarks to activities

In Ambari after a restart master checks for pending undertakings and reschedules them previously all assimilation out is persisted here. Also, the master rebuilds the come clean machines at the back there is a restart, as the cluster market is persisted in the database. While lawsuit beautifies master actually catastrophe in the in front recording their take keep busy, along amid there is a race condition. The events, on the other hand, should be idempotent, which is a unique consideration. And the master restarts any behavior that has not been marked as occurring or has failed in the database. These persistent behaviors are seen in Redo Logs.

Based approaching the desired make known

While the master attempts to make the cluster flesh and blood publicise, you will be encircled by more to in as per the intended freshen appendix, as the master persists in the desired own going in savings account to for of the cluster.