Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

January 04, 2022

Top 20 Apache Cassandra Interview Questions and Answers

  

                Apache recommends Cassandra as one of the most popular NoSQL distributed database management systems. Cassandra is an open-source database that is meant to store and manage enormous amounts of data without failure. Apache Cassandra is a Java database with flexible schemas that is highly scalable for Big Data models and was originally built by Facebook. There is no single point of failure with Apache Cassandra. Cassandra is a combination of column-oriented and key–value store databases, and it is one of the most popular NoSQL databases. The keyspace entity in Cassandra is the table or column family, which is the application's outermost container.

 Apache Camel Interview Questions and Answers

Ques. 1): What is the purpose of Cassandra and why should you utilise it?

Answer:

Cassandra was created with the goal of handling large data workloads across several nodes with no single point of failure. Cassandra's use is influenced by a number of things.

  • It's fault-tolerant and reliable.
  • Scalability from gigabytes to petabytes
  • It's a database with columns.
  • There are no singular points of failure.
  • There isn't a requirement for a separate caching layer.
  • Schema design that is adaptable
  • It has a flexible data storage system, simple data distribution, and quick write speeds.
  • The ACID (Atomicity, Consistency, Isolation, and Durability) qualities are supported.
  • Cloud and multi-data centre capabilities
  • Compression of data

 Apache Ant Interview Questions and Answers

Ques. 2): What are Cassandra's applications?

Answer:

When it comes to app development and data management, Cassandra has become the go-to solution for many businesses. Because of the ease with which an operator can work, even fresh start-ups choose it.

Cassandra is a fantastic application for collecting data from a variety of sources at a rapid rate. Cassandra could be used in an internet of things application. It might also be utilised in product and retail apps, as well as messaging, social media analytics, and even a recommendation engine.

 Apache Tomcat Interview Questions and Answers

Ques. 3): What are the advantages of utilising Cassandra?

Answer:

  • Apache Cassandra, unlike any other database, provides near real-time speed, making the work of Developers, Administrators, Data Analysts, and Software Engineers much easier.
  • Cassandra is built on a peer-to-peer architecture rather than a master–slave design, assuring no failure.
  • It also ensures incredible flexibility by allowing numerous nodes to be added to each Cassandra cluster in any data centre. In addition, any client can send a request to any server.
  • Cassandra supports extensible scalability and can be simply scaled up and down depending on the needs. This NoSQL application does not need to be restarted while scaling because it has a high throughput for read and write operations.
  • Cassandra is also known for its powerful data replication on nodes feature, which allows users to store data in numerous locations and recover data from a different location if one node fails. The amount of copies that users want to make can be set by them.
  • When used for large datasets, it performs admirably, making it the NoSQL DB of choice for most businesses.
  • Operates on a column-oriented structure, which speeds up and simplifies the slicing process. With a column-based data model, even data access and retrieval become more efficient.
  • Furthermore, Apache Cassandra features a schema-free/schema-optional data model, which eliminates the need to display all of the columns that your application requires.
  • Learn how Cassandra vs. MongoDB can help you advance your career.

 Apache Kafka Interview Questions and Answers

Ques. 4): In Cassandra, explain the idea of adjustable consistency.

Answer:

Cassandra's tunable consistency is a fantastic feature that makes it a popular database among developers, analysts, and big data architects. Consistency refers to all replicas having up-to-date and synced data rows. Cassandra's adjustable consistency allows users to choose the level of consistency that best suits their needs. It encourages two types of constancy: eventual and strong consistency.

The former ensures consistency when no new updates are made to a data item, i.e., all accesses eventually return the most recently modified value. Replica convergence is a term used to describe systems that achieve eventual consistency.

For strong consistency, Cassandra supports the following condition:

R + W > N where,

N – Number of replicas

W – Number of nodes that need to agree for a successful write

R – Number of nodes that need to agree for a successful read

 Apache Tapestry Interview Questions and Answers

Ques. 5): What is Cassandra's data storage method?

Answer:

Bytes are used to store all data.

Cassandra ensures that the bytes are encoded correctly when you specify validator.

The column is then ordered using a comparator depending on the encoding's specific ordering.

While composites are simply byte arrays with a specific encoding, each component holds a two-byte length, a byte encoded component, and a termination bit.

Apache Ambari interview Questions & Answers

Ques. 6): What is the definition of memtable?

Answer:

A memTable is a place where data is written and temporarily stored. After the data in the commit log has been completed, it is written to memtable.

In Cassandra, Memtable is a storage engine. Because each column category has its own MemTable, data in MemTable is classified into a key, and data is retrieved using the key. When the write memory is filled, the messages are automatically deleted.

 Apache Hive Interview Questions & Answers

Ques. 7): Explain the Bloom Filter concept.

Answer:

Bloom filter is an off-heap (off the Java heap to native memory) data structure associated with SSTable that checks whether there is any data accessible in the SSTable before conducting any I/O disc action.

 Apache Spark Interview Questions & Answers

Ques. 8): What are the functions of the shell commands "Capture" and "Consistency"?

Answer:

Cassandra has a number of Cqlsh shell commands. The command "Capture" saves the result of a command to a file, whereas the command "Consistency" shows the current consistency level or sets a new one.

 Apache NiFi Interview Questions & Answers

Ques. 9): What is the purpose of the read repair request?

Answer:

When the coordinator node sends requests, it checks in with the nodes to see if they have any outdated information. This data is transmitted to be read and repaired in the background before being replaced with the updated data. Read and repair requests are a way to maintain the data current while also ensuring that the requested row is consistent across all replicas.

 

Ques. 10): How does Cassandra write?

Answer:

Cassandra executes the write operation in two steps: first, it writes to a disc commit log, and then it commits to an in-memory structure called memtable. The write is complete after the two commits are successful. SSTables are used to store writes in the table structure (sorted string tables). Cassandra is more efficient when it comes to writing.

 

Ques. 11): What are the best Cassandra monitor tools?

Answer:

Despite the fact that Cassandra has built-in tolerance mechanisms, it still needs to be monitored for optimal outcomes. Cassandra utilises the following tools to keep track of its databases:

  • Solarwind server and application monitor
  • Instana
  • Instaclustr
  • AppDynamics
  • Dynatrace
  • Machine engine applications manager.

 

Ques. 12): What is Cassandra- CQL collections?

Answer:

Cassandra Multiple values can be stored in a single variable using CQL collections. CQL collections can be used in Cassandra in the following ways.

List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)

SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)

MAP: It is a data type used to store a key-value pair of elements

 

Ques. 13): What is Super Column in Cassandra?

Answer:

Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key–value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore > column family > super column > column data structure in JSON.

Similar to the row keys, super column data entries contain no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever.

 

Ques. 14): Describe the CAP Theorem.

Answer:

With a strong necessity to scale systems when new resources are required, the CAP Theorem is critical to the scaling strategy's success. It's a good approach to deal with scaling in distributed systems. The Consistency, Availability, and Partition Tolerance (CAP) theorem asserts that customers can only have two of these three qualities in distributed systems like Cassandra.

It's necessary to sacrifice one of them. Consistency ensures that the client receives the most recent writing; availability ensures a sensible response in the shortest time possible; and partition tolerance ensures that the system continues to operate even if network partitions occur. AP and CP are the two alternatives available.

 

Ques. 15): What is the difference between Column and Super Column?

Answer:

Both elements work on the principle of tuples having name and value. However, the former’s value is a string, while the value of the latter is a map of columns with different data types.

Unlike Columns, Super Columns do not contain the third component of timestamp.

 

Ques. 16): What exactly is a Column Family?

Answer:

A column family, as the name implies, is a structure with an endless number of rows. A key–value pair is used to refer to these, with the key being the column name and the value being the column data. In Java, it's equivalent to a hashmap, while in Python, it's analogous to a dictionary. Remember that the columns in the rows are not confined to a specified list. Furthermore, the column family is extremely adaptable, with one row having 100 columns and the other simply having two.

 

Ques. 17): Define the management tools in Cassandra.

Answer:

DataStax OpsCenter: It is the Internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional edition of OpsCenter.

SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, ZooKeeper, and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection, and heartbeat alerting.

 

Ques. 18): In Cassandra, explain the distinctions between a node, a cluster, and a data centre.

Answer:

Cassandra is made up of several parts. A cluster is a collection of nodes that have comparable sorts of data organised together, whereas a node is a single machine running Cassandra. When serving consumers from different parts of the world, data centres are essential components. You can divide a cluster's nodes into various data centres.

 

Ques. 19): What is the purpose of the Bloom Filter in Cassandra?

Answer:

A bloom filter is a space-saving data structure for determining if an element belongs to a set. In other words, it's used to see if an SSTable contains data for a specific row. When executing a KEY LOOKUP in Cassandra, it is utilised to save IO.

 

Ques. 20): What exactly is SSTable? What makes it unique among relational tables?

Answer:

SSTable stands for 'Sorted String Table,' and it refers to a crucial Cassandra data file that supports regular written memtables. They exist for each Cassandra table and are kept on disc. Because of their immutability, SSTables do not allow the insertion or removal of data items once they have been written. Cassandra creates three different files for each SSTable: a partition index, a partition summary, and a bloom filter.

 

January 03, 2022

Top 20 Apache Tomcat Interview Questions and Answers

  

       Tomcat is a Java Servlet container and web server developed by the Apache Software Foundation's Jakarta project. Client browsers send queries to a web server, which the server answers to with web pages. Web servers can generate dynamic content based on the user's requests. Because it supports both Java servlet and JavaServerPages (JSP) technologies, Tomcat excels at this. Even if a free servlet and JSP engine is required, Tomcat can be utilised as a web server for a variety of applications. It can run on its own or alongside standard web servers like Apache httpd, delivering static pages while Tomcat handles dynamic servlet and JSP queries.

    Apache Tomcat is an open source Java Servlet, JavaServer Pages, Java Expression Language, and Java WebSocket implementation platform. Many firms are hiring Devops engineers, Apache Tomcat administrators, Linux Apache Tomcat jobs, and Hadoop developers at varying levels of experience. The most popular Web server is Apache, and you must be familiar with it if you plan to work as a Middleware/System/Web administrator. Apache HTTP is a free and open-source web server that runs on Windows and Linux.

 Apache Kafka Interview Questions and Answers

Ques. 1): Who is in charge of Tomcat?

Answer:

The Apache Software Foundation is the correct answer. The Apache Software Foundation is a non-profit organisation that oversees several Open Source projects.

The Apache Software Foundation's Java-based projects are referred to as Jakarta.

Tomcat is an Apache Jakarta project that manages server-side Java (in the form of Servlets and JSPs). Tomcat is the "reference" implementation of the Servlet and JSP specifications, which means that anything that runs in Tomcat should run in any compliant Servlet / JSP container.

 Apache Tapestry Interview Questions and Answers

Ques. 2): Difference between apache and apache-tomcat server?

Answer: 

Apache: Apache is mostly used to serve static content, but there are numerous add-on modules (some of which are included with Apache) that allow it to modify the content and serve dynamic content written in Perl, PHP, Python, Ruby, and other languages.

Apache is an HTTP server that serves HTTP requests.

Tomcat is a servlet/JSP container developed by Apache. It's written in the Java programming language. Although it can provide static information, its primary function is to host servlets and JSPs.

JSP files (which are comparable to PHP and older ASP files) are converted into Java code (HttpServlet), which is then compiled into.class files and run by the Java virtual machine by the server.

Apache Tomcat is used to deploy your Java Servlets and JSPs. So in your Java project, you can build your WAR (short for Web ARchive) file, and just drop it in the deploy directory in Tomcat.

Although it is possible to get Tomcat to run Perl scripts and the like, you wouldn’t use Tomcat unless most of your content was Java.

Tomcat is a Servlet and JSP Server serving Java technologies

 Apache Ambari interview Questions & Answers

Ques. 3):  What exactly is Coyote?

Answer:

Coyote is a Tomcat Connector component that acts as a web server and supports the HTTP 1.1 protocol. This enables Catalina, which is ostensibly a Java Servlet or JSP container, to additionally serve local files as HTTP documents.

Coyote monitors a specific TCP port for incoming connections to the server and transmits the request to the Tomcat Engine, which processes the request and returns a response to the requesting client.

Coyote is Tomcat's HTTP connector, which offers an interface for browsers to connect to.

 Apache Hive Interview Questions & Answers

Ques. 4): What is a servlet container?

Answer:

A servlet container is a web server component that communicates with Java servlets. The servlet container is in charge of managing servlet lifecycles, mapping URLs to specific servlets, and ensuring that the URL requester has the appropriate access privileges.

Requests to servlets, JavaServer Pages (JSP) files, and other types of files containing server-side code are handled by the servlet container. The Web container generates servlet instances, loads and unloads servlets, creates and manages request and response objects, and handles other servlet-related operations.

The web component contract of the Java EE architecture is implemented by the servlet container, which defines a runtime environment for web components that includes security, concurrency, lifecycle management, transaction, deployment, and other services.

Apache Spark Interview Questions & Answers 

Ques. 5): How Do I Can Change The Default Home Page Loaded By Tomcat?

Answer :

We can easily override home page via adding welcome-file-list in application $TOMCAT_HOME/webapps//WEB-INF /web.xml file or by editing in container $TOMCAT_HOME/conf/web.xml

In $TOMCAT_HOME/conf/web.xml, it may look like this:

    index.html

    index.htm

    index.jsp

Request URI refers to a directory, the default servlet looks for a "welcome file" within that directory in following order: index.html, index.htm and index.jsp

Apache NiFi Interview Questions & Answers 

Ques. 6): what is a difference between Apache and Nginx web server?

Answer:

Both are classified as Web Servers, but there are a few key differences. Nginx is an event-driven web server, whereas Apache is a process-driven web server.

Nginx has a reputation for being faster than Apache.

Whereas Nginx does not support OpenVMS or IBMi, Apache supports a wide range of operating systems.

Nginx is still catching up to Apache in terms of module interoperability with backend application servers.

Nginx is a lightweight web server that is rapidly gaining market share. If you're new to Nginx, you might be interested in reading some of my Nginx articles.

 Apache Ant Interview Questions and Answers

Ques. 7): How Do You Create Multiple Virtual Hosts?

Answer :

If you want tomcat to accept requests for different hosts e.g. www.myhostname.com then you must

Create ${catalina.home}/www/appBase , ${catalina.home}/www/deploy, and ${catalina.home}/conf/Catalina/www.myhostname.com

Add a host entry in the server.xml file

Create the the following file under conf/Catalina/www.myhostname.com/ROOT.xml

Add any parameters specific to this hosts webapp to this context file

Put your war file in ${catalina.home}/www/deploy

When tomcat starts, it finds the host entry, then looks for any context files and will start any apps with a context.

 

Ques. 8): In Apache Tomcat, what is Catalina?

Answer:

Once Jasper has completed the compilation, it turns JSP into a servlet, which Catalina can then manage. Catalina is a servlet container for Tomcat. It also implements all of the Java server page and servlet specs. Catalina is a Java engine embedded into Tomcat that provides an efficient environment for servlets to execute in.

 

Ques. 9): What exactly do you mean by Tomcat's default port, and can it be used with SSL?

Answer:

Tomcat uses port 8080 as its default port. Well, you can change it by editing the server.xml file in the Tomcat install directory's conf folder. By adjusting the property to the desired port connection port="8080" and then restarting Tomcat, the modifications will take effect.

Tomcat can use SSL, but it will require some configuration. You must complete the following tasks:

Generate a keystore

Then add a connector in server.xml

Restart Tomcat

 

Ques. 10): What is a mod_evasive module, and what does it do?

Answer:

Mod_evasive is a third-party module that accomplishes one simple task really well. It identifies when your site is under attack by a Denial of Service (DoS) attack and mitigates the harm that the attack causes. When a single client makes repeated requests in a short period of time, mod evasive recognises this and refuses additional requests from that client. The ban can last for a very short time because it is simply reissued the following time a request is discovered from that same host.

 

Ques. 11): Explain Directory Structure Of Tomcat?

Answer :

Directory structure of Tomcat are:

bin - contain startup, shutdown, and other scripts (*.sh for UNIX and *.bat for Windows systems) and some jar files also there.

conf - Server configuration files (including server.xml) and related DTDs. The most important file in here is server.xml. It is the main configuration file for the container.

lib - contains JARs those are used by container and Servlet and JSP application programming interfaces (APIs).

logs - Log and output files.

webapps – deployed web applications reside in it .

work - Temporary working directories for web applications and mostly used during in JSP compilation where JSP is converted to a Java servlet.

temp - Directory used by the JVM for temporary files .

 

Ques. 12): Explain How Running Tomcat As A Windows Service Provides Benefits?

Answer :

Running Tomcat as a windows service provides benefits like:

Automatic startup: It is crucial for environment where you may want to remotely re-start a system after maintenance

Server startup without active user login: Tomcat is run oftenly on blade servers that may not even have an active monitor attached to them. Windows services can be started without an active user

Security: Tomcat under window service enables you to run it under a special system account, which is protected from the rest of the user accounts

 

Ques. 13): How Do Servlet Life Cycles Work?

Answer:

The life-cycle of a typical Tomcat servlet is as follows:

Through one of its connectors, Tom-cat receives a request from a client.

This request will be processed. This request is routed through Tomcat to the proper server.

Tomcat checks that the servlet class has been loaded after the request has been forwarded to the proper servlet. If it isn't, Tomcat wraps the servlet in Java Bytecode, which is executed by the JVM and creates a servlet instance.

The servlet is started by Tomcat by invoking its init method. The servlet includes code that can inspect Tomcat configuration files and take appropriate action, as well as declare any resources it might need.

Once the servlet has been started, Tomcat can call the servlet’s service method to proceed the request

Tomcat and the servlet can co-ordinate or communicate through the use of listener classes during the servlet’s lifecycle, which tracks the servlet for a variety of state changes.

To remove the servlet, Tomcat calls the servlets destroy method.

 

Ques. 14): In Tomcat, what is the difference between a host and a context?

Answer:

In Tomcat, the host is a component. It's a network name association for the server. On the other hand, context is an element that indicates a web application that is running on a certain virtual host. Web applications are built on top of a Web Application Archive (WAR) file or a corresponding directory that contains all of the unpacked content indicated in the servlet description.

 

Ques. 15): What Is The Distinction Between A Webserver And An Application Server?

Answer:

The main distinction between a web server and an application server is that a web server can only execute web applications, such as servlets and JSPs, and has just one container, the Web container, that is used to understand and execute web applications. The application server has the ability to run Enterprise applications, i.e. (servlets, jsps, and EJBs)

it is having two containers:

Web Container(for interpreting/executing servlets and jsps)

EJB container(for executing EJBs).

it can perform operations like load balancing , transaction demarcation etc.

 

Ques. 16): Apart from Apache Tomcat, what are the different kinds of Web Servers?

Answer:

There are many web servers as mentioned below:

LiteSpeed Web Server

GWS Web Server

Microsoft IIS Web Server

Nginx Web Server

Jigsaw Web Server

Sun Java System Web Server

Lighttpd Web Server

 

Ques. 17): How to limit upload size?

Answer:

I have a web application that allows users to upload files such as word documents, pdf and so on.  How do I limit file upload by users?

You can make use of the LimitRequestBody directive to limit upload file size.

<Directory "usr/local/apache2/uploads">

LimitRequestBody 9000

</Directory>

The value assigned to the LimitRequestBody allows Apache to accept and store file uploads of 9000 bytes by users. You can adjust the value based on the requirement.

 

Ques. 18): Explain how to use WAR files to deploy a web application.

Answer:

JSPs, servlets, and their associated files are placed under Tomcat's web applications directory in the appropriate subdirectories. You can combine all of the files in the web apps directory into a single compressed file with the extension.war. A web application can be run by placing a WAR file in the webapps directory. When a web server starts up, it extracts the contents of the WAR file and places them in the proper webapps sub-directories.

 

Ques. 19): How can an Apache Service be stopped by its control script?

Answer:

The Apache Service is controlled using a script called the apachectl.

So, to stop the service, we need to run the below-mentioned commands.

#apachectl stop [for Ubuntu based system]

# /etc/inid.t/httpd.stop [for red hat based system]

 

Ques. 20): What is the purpose of the Listen property in Apache Tomcat?

Listening is very important for Apache Tomcat and the developers.

If a developer has numerous IPs on the server, we must explicitly indicate IP and PORT in the Listen Drive if we want Apache to evaluate only one of them.

For example: 10.10.10.20

 

 

Top 20 Apache Kafka Interview Questions and Answers

 

Apache Kafka is a free and open-source streaming platform. Kafka began as a messaging queue at LinkedIn, but it has since grown into much more. It's a flexible tool for working with data streams that may be used in a wide range of situations. Because Kafka is a distributed system, it can scale up and down as needed. All that's left to do now is expand the cluster with new Kafka nodes (servers).

In a short length of time, Kafka can process a big volume of data. It also has a low latency, allowing for real-time data processing. Despite the fact that Apache Kafka is written in Scala and Java, it may be utilised with a wide range of computer languages.


Apache Hive Interview Questions & Answers


Ques. 1): What exactly do you mean when you say "confluent kafka"? What are the benefits?

Answer:

Confluent is an Apache Kafka-based data streaming platform that can do more than just publish and subscribe. It can also store and process data within the stream. Confluent Kafka is a more extensive version of Apache Kafka. It improves Kafka's integration capabilities by adding tools for optimising and maintaining Kafka clusters, as well as methods for ensuring the security of the streams. Because of the Confluent Platform, Kafka is simple to set up and use. Confluent's software is available in three flavours:

A free, open-source streaming platform that makes working with real-time data streams a breeze;

A premium cloud-based version with more administration, operations, and monitoring features; an enterprise-grade version with more administration, operations, and monitoring tools.

Following are the advantages of Confluent Kafka :

  • It features practically all of Kafka's characteristics, as well as a few extras.
  • It greatly simplifies the administrative operations procedures.
  • It relieves data managers of the burden of thinking about data relaying.


Apache Ambari interview Questions & Answers


Ques. 2): What are some of Kafka's characteristics?

Answer:

The following are some of Kafka's most notable characteristics:-

  • Kafka is a fault-tolerant messaging system with a high throughput.
  • A Topic is a built-in patriation system in Kafka.
  • Kafka also comes with a replication mechanism.
  • Kafka is a distributed messaging system that can manage massive volumes of data and transfer messages from one sender to another.
  • The messages can also be saved to storage and replicated across the cluster using Kafka.
  • Kafka works with Zookeeper for synchronisation and collaboration with other services.
  • Kafka provides excellent support for Apache Spark.


Apache Tapestry Interview Questions and Answers


Ques. 3): What are some of the real-world usages of Apache Kafka?

Answer:

The following are some examples of Apache Kafka's real-world applications:

Message Broker: Because Apache Kafka has a high throughput value, it can handle a large number of similar sorts of messages or data. Apache Kafka can be used as a publish-subscribe messaging system that makes it simple to read and publish data.

To keep track of website activity, Apache Kafka can check if data is successfully delivered and received by websites. Apache Kafka is capable of handling the huge volumes of data generated by websites for each page as well as user actions.

To keep track of metrics connected to certain technologies, such as security logs, we can utilise Apache Kafka to monitor operational data.

Data logging: Apache Kafka provides data replication between nodes functionality that can be used to restore data on failed nodes. It can also be used to collect data from various logs and make it available to consumers.

Stream Processing with Kafka: Apache Kafka can also handle streaming data, the data that is read from one topic, processed, and then written to another. Users and applications will have access to a new topic containing the processed data.


Apache NiFi Interview Questions & Answers


Ques. 4): What are some of Kafka's disadvantages?

Answer:

The following are some of Kafka's drawbacks:

  • When messages are tweaked, Kafka performance suffers. Kafka works well when the message does not need to be updated.
  • Kafka does not support wildcard topic selection. It's crucial to use the appropriate issue name.
  • When dealing with large messages, brokers and consumers degrade Kafka's performance by compressing and decompressing the messages. This has an effect on Kafka's performance and throughput.
  • Kafka does not support several message paradigms, such as point-to-point queues and request/reply.
  • Kafka lacks a comprehensive set of monitoring tools.


Apache Spark Interview Questions & Answers


Ques. 5): What are the use cases of Kafka monitoring?

Answer:

The following are some examples of Kafka monitoring use cases:

  • Monitor the use of system resources: It can be used to track the usage of system resources like memory, CPU, and disc over time.
  • Threads and JVM consumption should be monitored: To free up memory, Kafka relies on the Java garbage collector, which ensures that it runs frequently, ensuring that the Kafka cluster is more active.
  • Maintain an eye on the broker, controller, and replication statistics so that partition and replica statuses can be changed as needed.
  • Identifying which applications are producing excessive demand and performance bottlenecks may aid in quickly resolving performance issues.

 

Ques. 6): What is the difference between Kafka and Flume?

Answer:

Flume's main application is ingesting data into Hadoop. Hadoop's monitoring system, file types, file system, and tools like Morphlines are all incorporated into the Flume. When working with non-relational data sources or streaming a huge file into Hadoop, the Flume is the best option.

Kafka's main use case is as a distributed publish-subscribe messaging system. Kafka was not created with Hadoop in mind, therefore using it to gather and analyse data for Hadoop is significantly more difficult than using Flume.

When a highly reliable and scalable corporate communications system, such as Hadoop, is required, Kafka can be used.

 

Ques. 7): Explain the terms "leader" and "follower."

Answer:

In Kafka, each partition has one server that acts as a Leader and one or more servers that operate as Followers. The Leader is in charge of all read and write requests for the partition, while the Followers are responsible for passively replicating the leader. In the case that the Leader fails, one of the Followers will assume leadership. The server's load is balanced as a result of this.

 

Ques. 8): What are the traditional methods of message transfer? How is Kafka better from them?

Answer:

The classic techniques of message transmission are as follows: -

Message Queuing: -

The message queuing pattern employs a point-to-point approach. A message in the queue will be discarded once it has been eaten, similar to how a message in the Post Office Protocol is removed from the server once it has been delivered. These queues allow for asynchronous messaging.

If a network difficulty prevents a message from being delivered, such as when a consumer is unavailable, the message will be queued until it is transmitted. As a result, messages aren't always sent in the same order. Instead, they are distributed on a first-come, first-served basis, which in some cases can improve efficiency.

Publisher - Subscriber Model:-

The publish-subscribe pattern entails publishers producing ("publishing") messages in multiple categories and subscribers consuming published messages from the various categories to which they are subscribed. Unlike point-to-point texting, a message is only removed once it has been consumed by all category subscribers.

Kafka caters to a single consumer abstraction, the consumer group, which contains both of the aforementioned. The advantages of adopting Kafka over standard communications transfer mechanisms are as follows:

Scalable: Data is partitioned and streamlined using a cluster of devices, which increases storage capacity.

Faster: A single Kafka broker can handle megabytes of reads and writes per second, allowing it to serve thousands of customers.

Durability and Fault-Tolerant: The data is kept persistent and tolerant to any hardware failures by copying the data in the clusters.

  

Ques. 9): What is a Replication Tool in Kafka? Explain how to use some of Kafka's replication tools.

Answer:

The Kafka Replication Tool is used to define the replica management process at a high level. Some of the replication tools available are as follows:

Replica Leader Election Tool of Choice: The Preferred Replica Leader Election Tool distributes partitions to many brokers in a cluster, each of which is known as a replica. The favourite replica is a term used to describe the leader. For various partitions, the brokers generally distribute the leader position fairly across the cluster, but due to failures, planned shutdowns, and other circumstances, an imbalance might develop over time. By reassigning the preferred copies, and hence the leaders, this tool can be utilised to maintain the balance in these instances.

Topics tool: The Kafka topics tool is in charge of all administration operations relating to topics, including:

  • Listing and describing the topics.
  • Topic generation.
  • Modifying Topics.
  • Adding a topic's dividers.
  • Disposing of topics.

Tool to reassign partitions: The replicas assigned to a partition can be changed with this tool. This refers to adding or removing followers from a partition.

StateChangeLogMerger tool: The StateChangeLogMerger tool collects data from brokers in a cluster, formats it into a central log, and aids in the troubleshooting of state change issues. Sometimes there are issues with the election of a leader for a particular partition. This tool can be used to figure out what's causing the issue.

Change topic configuration tool: used to create new configuration choices, modify current configuration options, and delete configuration options.

 

Ques. 10):  Explain the four core API architecture that Kafka uses.

Answer:

Following are the four core APIs that Kafka uses:

Producer API:

The Producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics.

Consumer API:

The Kafka Consumer API allows an application to subscribe to one or more Kafka topics. It also allows the programme to handle streams of records generated in connection with such topics.

Streams API: The Kafka Streams API allows an application to process data in Kafka using a stream processing architecture. This API allows an application to take input streams from one or more topics, process them with streams operations, and then generate output streams to send to one or more topics. In this way, the Streams API allows you to turn input streams into output streams.

Connect API:

The Kafka Connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic.

  

Ques. 11): Is it possible to utilise Kafka without Zookeeper?

Answer:

As of version 2.8, Kafka can now be utilised without ZooKeeper. When Kafka 2.8.0 was released in April 2021, we all had the opportunity to check it out without ZooKeeper. This version, however, is not yet ready for production and is missing a few crucial features.

It was not feasible to connect directly to the Kafka broker without using Zookeeper in prior versions. This is because the Zookeeper is unable to fulfil client requests when it is down.

 

Ques. 12): Explain Kafka's concept of leader and follower.

Answer:

Each partition in Kafka has one server acting as a Leader and one or more servers acting as Followers. The Leader is in control of the partition's read and write requests, while the Followers are in charge of passively replicating the leader. If the Leader is unable to lead, one of the Followers will take over. As a result, the server's load is balanced.

 

Ques. 13): In Kafka, what is the function of partitions?

Answer:

From the standpoint of the Kafka broker, partitions allow a single topic to be partitioned across many servers. This gives you the ability to store more data in a single topic than a single server. If you have three brokers and need to store 10TB of data in a topic, you can create a subject with only one partition and store the entire 10TB on one broker. Another option is to create a three-partitioned topic with 10 TB of data distributed across all brokers. From the consumer's perspective, a partition is a unit of parallelism.

 

Ques. 14): In Kafka, what do you mean by geo-replication?

Answer:

Geo-replication is a feature in Kafka that allows you to copy messages from one cluster to a number of other data centres or cloud locations. You can use geo-replication to replicate all of the files and store them all over the world if necessary. Using Kafka's MirrorMaker Tool, we can achieve geo-replication. We can ensure data backup without fail by employing the geo-replication strategy.

 

Ques. 15): Is Apache Kafka a platform for distributed streaming? What are you going to do with it?

Answer:

Yes. Apache Kafka is a platform for distributed streaming data. Three critical capabilities are included in a streaming platform:

  • We can easily push records using a distributed streaming infrastructure.
  • It has a large storage capacity and allows us to store a large number of records without difficulty.
  • It assists us in processing records as they arrive.
  • The Kafka technology allows us to do the following:
  • We may create a real-time stream of data pipelines using Apache Kafka to send data between two systems.
  • We could also create a real-time streaming platform that reacts to data.

 

Ques. 16): What is Apache Kafka Cluster used for?

Answer:

Apache Kafka Cluster is a messaging system that is used to overcome the challenges of gathering and processing enormous amounts of data. The following are the most important advantages of Apache Kafka Cluster:

We can track web activities using Apache Kafka Cluster by storing/sending events for real-time processes.

We may use this to both alert and report on operational metrics.

We can also use Apache Kafka Cluster to transform data into a common format.

It enables the processing of streaming data to the subjects in real time.

It is currently ruling over some of the most popular programmes such as ActiveMQ, RabbitMQ, AWS, and others due to its outstanding characteristics.

 

Ques. 17): What is the purpose of the Streams API?

Answer:

Streams API is an API that allows an application to function as a stream processor, ingesting an input stream from one or more topics and providing an output stream to one or more output topics, as well as effectively changing the input streams to output streams.

 

Ques. 18): In Kafka, what do you mean by graceful shutdown?

Answer:

Any broker shutdown or failure will be detected automatically by the Apache cluster. In this case, new leaders will be picked for partitions previously handled by that device. This can occur as a result of a server failure or even when the server is shut down for maintenance or configuration changes. Kafka provides a graceful approach for ending a server rather than killing it when it is shut down on purpose.

When a server is turned off, the following happens:

Kafka guarantees that all of its logs are synced onto a disc to avoid having to perform any log recovery when it is restarted. Purposeful restarts can be sped up since log recovery requires time.

Prior to shutting down, all partitions for which the server is the leader will be moved to the replicas. The leadership transfer will be faster as a result, and the period each partition is inaccessible will be decreased to a few milliseconds.

  

Ques. 19): In Kafka, what do the terms BufferExhaustedException and OutOfMemoryException mean?

Answer:

A BufferExhaustedException is thrown when the producer can't assign memory to a record because the buffer is full. If the producer is in non-blocking mode and the pace of production over an extended period of time exceeds the rate at which data is transferred from the buffer, the allocated buffer will be emptied and an exception will be thrown.

An OutOfMemoryException may occur if the consumers send large messages or if the quantity of messages sent increases faster than the rate of downstream processing. As a result, the message queue becomes overburdened, using RAM.

 

Ques. 20): How will you change the retention time in Kafka at runtime?

Answer:

A topic's retention time can be configured in Kafka. A topic's default retention time is seven days. While creating a new subject, we can set the retention time. When a topic is generated, the broker's property log.retention.hours are used to set the retention time. When configurations for a currently operating topic need to be modified, kafka-topic.sh must be used.

The right command is determined on the Kafka version in use.

The command to use up to 0.8.2 is kafka-topics.sh --alter.

Use kafka-configs.sh --alter starting with version 0.9.0.

 


 

December 30, 2021

Top 20 Apache Tapestry Interview Questions and Answers

 

                Do you want to succeed in the Apache Tapestry and advance your career? Then, on this page, we will offer you with the complete set of Apache Tapestry job Interview Questions and Answers. Apache Tapestry is a Java-based open source web framework. It's a web framework that's built around components. Apache Tapestry is a framework for building extremely scalable web applications. Many top firms offer Apache Tapestry jobs in a variety of positions. In the Java programming language, there are more opportunities for experienced workers. Job hunting may be difficult and exhausting, especially if you don't know how to apply, where to look, or how to prepare for job interviews. To minimise any misunderstanding, we've framed Apache tapestry job interview questions and answers to help you prepare for your interview.

 Apache Hive Interview Questions & Answers

Ques. 1): How do I run multiple Tapestry applications in the same web application?

Answer:

Multiple Tapestry 5 apps are not supported; there is only one place to identify the application root package, therefore configuring multiple filters into multiple directories isn't an option.

Tapestry 5 did not include support for numerous Tapestry applications in the same web application (it needlessly complicated Tapestry 4). Given how disjointed Tapestry 5 pages are, there doesn't appear to be a benefit to doing so... and, if it were possible, there would be a considerable drawback in terms of memory use.

You can run Tapestry 4 and Tapestry 5 apps side by side (the package names are different for this reason), but they have no knowledge of each other and are unable to interact directly. This is just like the way you could have a single WAR with multiple servlets; the different applications can only communicate via URLs, or shared state in the HttpSession.

 Apache Ambari interview Questions & Answers

Ques. 2): Tapestry focuses on the wrong field in my form, how do I fix that?

Answer:

Tapestry usually determines which field in your form should receive initial emphasis by giving a FieldFocusPriority to each field as it renders, which translates to the following logic:

There is an error in the first field.

Alternatively, the first mandatory field

Alternatively, the first field

Due to a variety of reasons beyond Tapestry's control, its choices may not always be exactly what you desire, necessitating the use of an override. The data is kept track of in the JavaScriptSupport environment. It's simply a matter of injecting the component to get its client id, then notifying JavaScriptSupport of your override.

Here's an example

 <t:textfield t:id="email" t:mixins="OverrideFieldFocus" .../>

The OverrideFieldFocus mixin forces the email field to be the focus field, regardless.

 Apache NiFi Interview Questions & Answers

Ques. 3): Is Tapestry A Jsp Tag Library?

Answer :

Tapestry is not a JSP tag library; it is based on the servlet API but does not make use of JSPs. It has its own rendering engine and HTML template format. Tapestry now offers a simple JSP tag library in release 3.0, allowing JSP pages to link to Tapestry pages.

Apache Spark Interview Questions & Answers

Ques. 4): I Have A Form With A Submit Button. On The Form And The Submit Button Are Two Separate Listeners. Which Is Invoked First?

Answer :

While the shape encounters your button at some time throughout the rewind, the button's listener must be called. The form's submitListener must be called after the shape has completed its rewind, therefore different listeners were called in either scenario. Note that this could mean that the listener for a button can be called before the form has'submitted' all of its values; it all depends on where your input fields are in relation to your button.

 

Ques. 5): Is There A Wysiwyg Editor For Tapestry, Or An Ide Plugin?

Answer :

Tapestry currently lacks a WYSIWYG editor; nonetheless, the nature of Tapestry allows existing editors to function reasonably well (Tapestry additions to the HTML markup are virtually invisible to a WYSIWYG editor). Spindle is a Tapestry plugin for the Eclipse IDE, which is free and open-source. Tapestry apps, pages, and components may now be created using wizards and editors.

 

Ques. 6): How Is The Performance Of Tapestry?

Answer :

Other testing (recorded in the Tapestry discussion boards) coincides with my own testing, which was published in the September 2001 issue of the Java Report: Although plain JSPs have a minor advantage in demo applications, performance curves for equal Tapestry and JSP applications with a database or application server backend are identical. Consider the performance of your Java developers rather than the performance of Tapestry.

 

Ques. 7): What’s The Lifecycle Of A Form Submit?

Answer :

Events will trigger in the following order:

initialize()

pageBeginRender()

formListenerMethod()

pageBeginRender()

The form "rewind" cycle is simply a render cycle with the output buffered and scraped instead of being written to the servlet output stream. The second pageBeginRender() is called while the page is being rendered. To distinguish between these two render cycles, use requestCycle.isRewinding().

 

Ques. 8): Does Tapestry Work With Other Other Application Servers Besides Jboss?

Answer :

Yes, of course! For the turn-key demonstrations, JBoss is free and convenient. In less than a minute, you can download Tapestry and JBoss and have a real J2EE application running! JBoss configuration scripts are specific to a specific release of JBoss, which must be 3.0.6. Tapestry apps, on the other hand, are completely container agnostic... Tapestry is unconcerned with the servlet container it's in, and it doesn't even require an EJB container.

 

Ques. 9): Can I Use The Same Component Multiple Times In One Template?

Answer:

No – but you can copy the definition of a component pretty easily.

<component id=”valueInsert” type=”Insert” >

 <binding name=”value” expression=”getValueAt( rowIndex, columnIndex )” />

 </component>

<component id=”valueInsert1″ copy-of=”valueInsert”/>

 <component id=”valueInsert2″ copy-of=”valueInsert”/>

 <component id=”valueInsert3″ copy-of=”valueInsert”/>

 <component id=”valueInsert4″ copy-of=”valueInsert”/>

 

Ques. 10): Why is @script required in Apache Tapestry?

Answer:

The script framework is a useful tool for grouping scripts into components. It delivers the benefits of components to scripts. It may now be utilised as a component without having to bother about renaming field names or rewiring the fields and scripts. All you have to do now is declare the component and you're ready to go. It is true that another layer of abstraction must be acquired, but once learned, it is extremely powerful. And, to tell you the truth, there isn't much to it.

The script framework is required since form element/field names are produced automatically by the framework. As a result, you write your script in XML, assigning these names to variables, and relying on the framework to deliver the correct names at runtime. Further, you can request that the framework include extra objects that will aid in the creation of your script.

 

Ques. 11): When a form is submitted, why does Tapestry send a redirect?

Answer:

This is a variant of the Post/Redirect/Get strategy. It ensures that if a user resubmits the resultant page after an operation that alters server-side state, such as a form submission, the operation is not repeated; instead, only the results of the operation, reflecting the updated server-side state, are re-rendered.

This has the unwelcome consequence of requiring any data required to produce the answer to persist between the event request (the form submission) and the render request; this frequently necessitates the use of @Persist annotations on fields.

 

Ques. 12): When I utilise an HTML object like &nbsp; in my template, why do I get a SAXParseException?

Answer:

Tapestry reads your templates using a regular SAX parser. This means that your templates must be well-formed, with balanced open and close tags, quoted attribute values, and defined entities. The simplest method to do this is to include a DOCTYPE at the top of your template:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Part of the DOCTYPE is the declaration of entities such as &nbsp;.

You can also use the numeric version: &#160; This is the exact same character and will render in the browser identically. When no doctype is present, Tapestry adds an XHTML doctype; this ensures that common HTML entities behave properly.

 

Ques. 13): When I submit a form, I get an error message that says "Tapestry is undefined." Why?

Answer:

This client-side mistake is obvious, yet it can be difficult to resolve. It implies your browser was unable to correctly load the tapestry.js file. Why, exactly, is the question? It could be for a variety of causes, some of which are listed below:

Check to see if 'tapestry.js' is present in the head section of your HTML page.

Tapestry generates a single URL to get all the JS files if the tapestry.combine-scripts configuration parameter is set to true. This can sometimes result in long URLs that browsers are unable to get. Set the sign to false and see what happens.

If you use jQuery in conjunction with Tapestry's prototype, the '$' selection used by both will clash. Put jQuery on top of the stack and enable the jQuery.noConflict mode in this scenario.

Also, if you've added a custom or third-party JS library to the stack and it's causing the JavaScript parsing to fail, look at the JavaScript syntax in that library.

If you used a programme to compress your JavaScript libraries, it's possible that you'll get JavaScript syntax issues, so make sure it works with all of the JavaScript files unpacked.

 

Ques. 14): Do I have to specify both id and t:id for Zone components?

Answer:

The Zone component's examples (in the Component Reference) always give both id and t:id, which is definitely a good thing.

In general, the client-side id (the id attribute) will be the same as the Tapestry component id if you don't define it (t:id).

There are, however, numerous exceptions to this rule. It's possible that the Zone is rendering inside a Loop (in which case, each rendering will have a unique client side id). A random unique id is added into the id if the Zone is rendering as part of a partial page render. Tapestry component ids in nested components can potentially clash in other situations.

 

Ques. 15): When rendering an empty Zone, why do I get the exception "The produced content did not include any components that allow for the positioning of the concealed form field's element."?

Answer:

Tapestry must write a hidden input element with information needed when the form is submitted as part of its form processing. Because the content of a Zone can be altered or erased, a hidden field separate from the rest of the enclosing form is established solely for the Zone.

At the same time, Tapestry wants to position the <input> field in a valid location, and HTML defines some constraints for that; an input field must appear inside a <p> or <div> element. If your zone is initially empty, there's no place to put the hidden element, and Tapestry will complain.

The solution is simple: just add a <div> element to the body of the zone. This ensures that there's a place for the hidden input field.  An empty <div> element (even one containing a hidden form field) will not affect page layout.

 

Ques. 16): Why is it necessary for me to provide an interface for my services? Why can't I simply use the class?

Answer:

To begin with, you can do just this, but you will lose some of the functionality provided by Tapestry's IoC container.

Tapestry will be able to provide functionality for your service around the core service implementation as a result of the split. This is accomplished by using proxies, which are Java classes that implement the service interface. The proxy's methods will eventually call the methods of your service implementation.

One of the most important functions of proxies is to encapsulate the life cycle of a service: most services are singletons that are built just once. Just in time refers to the moment you call a method. What's going on is that the life cycle proxy (the object that gets injected into pages, components or other service implementations) checks on each method invocation to see if the actual service exists yet. If not, it instantiates and configures it (using proper locking to ensure thread safety), then delegates the method invocation to the service.

If you bind a service class (not a service interface and class), then the service is fully instantiated the first time it is injected, rather than at that first method invocation. Further, you can't use decorations or method advice on such a service. The life cycle proxy (the object injected into pages, components, and other service implementations) examines each method invocation to determine if the actual service has been created yet. If it isn't already instantiated and configured (with suitable locking to ensure thread safety), it delegated method invocation to the service.

When you bind a service class (rather than a service interface and class), the service is fully instantiated the first time it is injected, rather than when the first method is invoked. Furthermore, such a service does not allow for the use of decorations or technique suggestions.

The final reason for the service interface / implementation split is to nudge you towards always coding to an interface, which has manifest benefits for code structure, robustness, and testability.

 

Ques. 17): How do I make my service startup with the rest of the application, rather than lazily?

Answer:

Tapestry services are designed to be lazy; they are only fully realized when needed: when the first method on the service interface is invoked.

Sometimes a service does extra work that is desirable at application startup: examples may be registering message handlers with a JMS implementation, or setting up indexing. Since the service's constructor (or @PostInjection methods) are not invoked until the service is realized.

The solution is the @EagerLoad annotation; service implementation classes marked with this annotation are loaded when the Registry is first startup, rather than lazily.

 

Ques. 18): How can I dynamically add new components to an existing page?

Answer:

You don't, to put it succinctly. The long answer is that you don't have to in order to achieve the desired behaviour.

High scalability is one of Tapestry's core values; it can be expressed in a variety of ways, reflecting scalability concerns both within a single server and across a cluster of servers.

Although you code Tapestry pages and components as if they were ordinary POJOs (Plain Old Java Objects — Tapestry doesn't require you to extend any base classes or implement any special interfaces), they behave more like a traditional servlet when deployed by Tapestry: a single instance of each page serves requests from multiple threads. In the background,

 

Ques. 19): What this means is that any incoming request must be handled by a single page instance. Therefore, Tapestry enforces the concept of static structure, dynamic behavior.

Answer:

 Beyond simple conditionals and loops, Tapestry offers a variety of options for varying what content is presented. When rendering a page, you can "drag in" components from other pages (other FAQs will expand on this concept). The idea is that, while the structure of a Tapestry page is quite strict, the order in which the page's components render does not have to be top to bottom.

 

Ques. 20): Why do my images and stylesheets end up with a weird URLs like /assets/meta/zeea17aee26bc0cae/layout/layout.css?

Answer:

The servlet container isn't used by Tapestry to serve static assets (images, stylesheets, flash movies, etc.). Tapestry, on the other hand, handles the requests and streams assets to the browser.

The content of the assets will be compressed using GZIP (if the client supports compression, and the content is compressible). Tapestry will also add an expires header to the content in the future. This means the browser will not ask for the file again, resulting in a significant reduction in network traffic.

The strange hex string is a fingerprint; it's a hash code calculated from the asset's real content. If the asset changes, a new fingerprint will be created, as well as a new path and (immutable) resource. This approach, combined with a far-future expires header also provided by Tapestry, ensures that clients aggressively cache assets as they navigate your site, or even between visits.



November 17, 2021

Top 20 Apache Hive Interview Questions & Answers


Ques: 1). What is Apache Hive, and how does it work?

Answer:

Apache Hive is a Hadoop-based, sophisticated warehouse project. This platform focuses on data analysis and includes data query capabilities. Hive is comparable to SQL in that it provides a user interface for querying data stored in files and database systems. And Apache Hive is a popular data analysis and querying technology used by Fortune 500 companies around the world. When it is cumbersome or inefficient to run the logic in HiveQL, Hive allows standard map reduce programmes to customise mappers and reducers (User Defined Functions UDFS).

 

BlockChain Interview Question and Answers


Ques: 2). What is the purpose of Hive?

Answer:

Hive is a Hadoop tool that allows you to organise and query data in a database-like format, as well as write SQL-like queries. It can be used to access and analyse Hadoop data using SQL syntax.

 Apache Ambari interview Questions & Answers

Ques: 3). What are the differences between local and remote meta stores?

Answer:

Local meta store: When using the Local Meta store configuration, the specified meta store service, as well as the Hive service, will run on the same Java Virtual Machine (JVM) and connect to databases that are operating in distinct JVMs, either on the same machine or on a remote machine.

Remote meta store: The Meta store service and the Apache Hive service will execute on distinct JVMs in the Remote Meta store. To connect to meta store servers, all other processes use Thrift Network APIs. You can have many meta store servers in Remote meta store for high availability.

Apache Tapestry Interview Questions and Answers

Ques: 4). Explain the core difference between the external and managed tables?

Answer:

The following are the fundamental distinctions between managed and external tables:

When a managed table is dropped, the complete metadata and table data is lost. The Hive just deletes the metadata information associated with a table and leaves the table data in HDFS, whereas the external table is quite different.

Tables that are managed and tables that are external. Hive manages the data by default when you create a table, which means it moves the data into its warehouse directory. Alternatively, you can construct an external table, which instructs Hive to refer to data stored somewhere other than the warehouse directory.

The semantics of LOAD and DROP show the difference between the two table types. Let's start with a managed table. Data loaded into a managed table is stored in Hive's warehouse directory.

 Apache NiFi Interview Questions & Answers

Ques: 5). What is the difference between a read-only schema and a write-only schema?

Answer:

A table's schema is enforced at data load time in a conventional database. The data being loaded is rejected if it does not conform to the schema. Because the data is validated against the schema when it is written into the database, this architecture is frequently referred to as schema on write.

Hive, on the other hand, verifies data when it is loaded, rather than when it is queried. This is referred to as schema on read.

Between the two approaches, there are trade-offs. Because the data does not have to be read, parsed, and serialized to disc in the database's internal format, schema on read allows for a very quick first load. A file copy or move is all that is required for the load procedure. It's also more adaptable: think of having two schemas for the same underlying data, depending on the analysis. (External tables can be used in Hive for this; see Managed Tables and External Tables.)

Because the database can index columns and compress the data, schema on write makes query time performance faster. However, it takes longer to load data into the database as a result of this trade-off. Furthermore, in many cases, the schema is unknown at load time, thus no indexes can be applied because the queries have not yet been formed. Hive really shines in these situations.

 Apache Spark Interview Questions & Answers

Ques: 6). Write a query to insert a new column? Can you add a column with a default value in Hive?

Answer:

ALTER TABLE test1 ADD COLUMNS (access_count1 int); You cannot add a column with a default value in Hive. The addition of the column has no effect on the files that support your table. Hive interprets NULL as the value for every cell in that column in order to deal with the "missing" data.

In Hive, you must effectively recreate the entire table, this time with the column filled in. It's possible that rerunning your original query with the additional column will be easier. Alternatively, you might add the column to the table you already have, then select all of its columns plus the new column's value.


Ques: 7). What is the purpose of Hive's DISTRIBUTED BY clause?

Answer:

DISTRIBUTE BY determines how map output is split between reducers. By default, MapReduce computes a hash on the keys output by mappers and uses the hash values to try to distribute the key-value pairs evenly among the available reducers. Let's say we want all of the data for each value in a column to be collected at the same time. To ensure that the records for each get to the same reducer, we can use DISTRIBUTE BY. In the same way that GROUP BY determines how reducers receive rows for processing, DISTRIBUTE BY does the same.

If the DISTRIBUTE BY and SORT BY clauses are in the same query, Hive expects the DISTRIBUTE BY clause to be before the SORT BY clause. When you have a memory-intensive job, DISTRIBUTE BY is a helpful workaround because it requires Hadoop to employ Reducers instead of having a Map-only job. Essentially, Mappers gather data depending on the DISTRIBUTE BY columns supplied, reducing the framework's overall workload, and then transmit these aggregates to Reducers.

 

Ques: 8). What occurs when you perform a query in HIVE, please?

Answer:

The Query Planner examines the query and turns it to a Hadoop Map Reduce job’s DAG (Directed Acyclic Graph).

The jobs are submitted to the Hadoop cluster in the order that the DAG suggests.

Only mappers are used for simple queries. The Input Output format is in charge of splitting an input and reading data from HDFS. After that, the data is sent to a layer called SerDe (Serializer Deserializer). The deserializer part of the SerdDe converts data as a byte stream to a structured format in this example.

Reducers will be included in Map Reduce jobs for aggregate queries. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

 

Ques: 9). What is the importance of STREAM TABLE?

Answer:

When you need information from several tables, joins are useful, but when you have 1.5 billion or more data in one table and want to link it to a master table, the order of the joining tables is crucial.

Consider the following scenario: 

select foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a; 

Because Hive streams the right-most table (bar) and buffers other tables (foo) in memory before executing map-side/reduce-side joins. As a result, if you buffer 1.5 billion or more records, your join query will fail since 1.5 billion records will very certainly fill up Java-Heap space exception. 

So, to overcome this limitation and free the user to remember the order of joining tables based on their record-size, Hive provides a key-word /*+ STREAMTABLE(foo) */ which tells Hive Analyzer to stream table foo.

select /*+ STREAMTABLE(foo) */ foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;

Hence, in this way user can be free of remembering the order of joining tables.

 

Ques: 10). When is it appropriate to use SORT BY instead of ORDER BY?

Answer:

When working with huge volumes of data in Apache Hive, we use SORT BY instead of ORDER BY. The fact that SORT BY comes with numerous reducers is one of the reasons for utilising it. This cuts down on the amount of time it takes to complete the task. ORDER BY, on the other hand, consists of only one reduce, which means the process takes longer than usual to complete.

 

Ques: 11). What is the purpose of Hive's Partitioning function?

Answer:

Partitioning allows users to arrange data in the Hive table in the way they want it. As a result, the system would be able to scan only the relevant data rather than the complete data set.

Consider the following scenario: Assume we have transaction log data from a business website for years such as 2018, 2019, 2020, and so on. So, in this case, you can utilise the partition key to find data for a specified year, say 2019, which will reduce data scanning by removing 2018 and 2020.

 

Ques: 12). What is dynamic partitioning and how does it work?

Answer:

The values of partition columns are exposed during runtime in dynamic partitioning, i.e. the values are known when you load data into Hive tables. The following are some examples of how dynamic partitioning is commonly used:

To move data from a non-partitioned table to a partitioned table, which reduces latency and improves sampling.

 

Ques: 13). In hive, what's the difference between dynamic and static partitioning?

Answer:

Hive partitioning is highly beneficial for pruning data during queries in order to reduce query times.

When data is inserted into a table, partitions are produced. Partitions are required depending on how data is loaded. When loading files (especially large files) into Hive tables, static partitions are usually preferred. When compared to dynamic partition, this saves time when loading data. You "statically" create a partition in the table and then move the file into that partition. 

Because the files are large, they are typically created on HDFS. Without reading the entire large file, you can retrieve the partition column value from the filename, date, and so on. In the case of dynamic partitioning, the entire large file is read, i.e. every row of data is read, and the data is partitioned into the target tables using an MR job based on specified fields in the file.

Dynamic partitions are typically handy when doing an ETL operation in your data pipeline. For example, suppose you use the transfer command to load a large file into Table X. Then you run an idle query into Table Y and split data based on table X fields such as day and country. You could wish to execute an ETL step to partition the data in Table Y's nation partition into a Table Z where the data is partitioned based on cities for a specific country alone, and so on.

Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.

 

Ques: 14).What is ObjectInspector in Hive?

Answer:

The ObjectInspector is a feature that allows us to analyze individual columns and internal structure of a row object in Hive. This also provides a seamless way to access complex objects that can be stored in varied formats in the memory.

  • A standard Java object
  • An instance of the Java class
  • A lazily initialized object

The ObjectInspector lets the users know the structure of an object and also helps in accessing the internal fields of an object.

 

Ques: 15). How does impala outperform hive in terms of query response time?

Answer:

Impala should be thought of as "SQL on HDFS," whereas Hive is more "SQL on Hadoop."

Impala, in other words, does not require Hadoop at whatsoever. It simply runs daemons on all of your nodes that store some of the data in HDFS, allowing these daemons to return data rapidly without having to conduct a full Map/Reduce process.

The rationale for this is that running a Map/Reduce operation has some overhead, so short-circuiting Map/Reduce completely can result in a significant reduction in runtime.

That stated, Impala is not a replacement for Hive; it is useful in a variety of situations. When compared to Hive, Impala does not support fault-tolerance, therefore if there is a problem during your query, it will be gone. I would recommend Hive for ETL processes where a single job failure would be costly, but Impala can be great for tiny ad-hoc queries, such as for data scientists or business analysts who just want to look at and study some data without having to develop substantial jobs.

 

Ques: 16). Explain the different components used in the Hive Query processor?

Answer:

Below mentioned is the list of Hive Query processors:

  • Metadata Layer (ql/metadata)
  • Parse and Semantic Analysis (ql/parse)
  • Map/Reduce Execution Engine (ql/exec)
  • Sessions (ql/session)
  • Type Interfaces (ql/typeinfo)
  • Tools (ql/tools)
  • Hive Function Framework (ql/udf)
  • Plan Components (ql/plan)
  • Optimizer (ql/optimizer)

 

Ques: 17). What is the difference between Hadoop Buffering and Hadoop Streaming?

Answer:

Using custom made python or shell scripts to implement your map-reduce logic is known as Hadoop Streaming. (Use the Hive TRANSFORM keyword, for example.)

In this context, Hadoop buffering refers to the phase in a map-reduce job of a Hive query with a join when records are read into the reducers after being sorted and grouped by the mappers. The author explains why you should order the join clauses in a Hive query so that the largest tables come last; this helps Hive implement joins more efficiently.

 

Ques: 18). How will the work be optimised by the map-side join?

Answer:

Let's pretend we have two tables, one of which is a little table. A Map Reduce local job will be generated before the original join Map Reduce task, which will read data from HDFS and put it into an in-memory hash table. It serialises the in-memory hash table into a hash table file after reading it.

The data in the hash table file is then moved to the Hadoop distributed cache, which populates these files to each mapper's local disc in the following stage, while the original join Map Reduce process is running. As a result, all mappers can reload this permanent hash table file into memory and perform the join operations as previously. 

The optimised map join's execution sequence is depicted in the diagram below. The short table just has to be read once after optimization. In addition, if many mappers are operating on the same system, the distributed cache only needs to send a single copy of the hash table file to this machine.

Advantages of using Map-side join:

Using Map-side join reduces the cost of sorting and combining data in theshuffle and reduces stages. The map-side join also aids task performance by reducing the time it takes to complete the assignment.

Disadvantages of Map-side join:

It is only suitable for use when one of the tables on which the map-side join operation is performed is small enough to fit into memory. As a result, performing a map-side join on tables with a lot of data in each of them isn't a good idea.

 

Ques: 19).What type of user defined functions exists in HIVE?

Answer:

A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.

A UDF must satisfy the following two properties:

  • A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
  • A UDF must implement at least one evaluate() method.

 

A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.
  • A UDAF must satisfy the following two properties:
  • A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF;
  • An evaluator must implement five methods:
    • init()
    • iterate()
    • terminatePartial()
    • merge()
    • terminate()

  • A UDTF operates on a single row and produces multiple rows — a table — as output.
  • A UDTF must be a subclass of org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
  • A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.
  • The initialize method is called by Hive to notify the UDTF the argument types to expect.
  • The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
  • Once initialize() has been called, Hive will give rows to the UDTF using the process() method.
  • While in process(), the UDTF can produce and forward rows to other operators by calling forward().
  • Lastly, Hive will call the close() method when all the rows have passed to the UDTF.

 

Ques: 20). Is the HIVE LIMIT clause truly random?

Answer:

Although the manual claims that it returns rows at random, this is not the case. Without any where/order by clause, it returns "selected rows at random" as they occur in the database. This doesn't imply it's truly random (or randomly picked), but it does suggest that the order in which the rows are returned can't be predicted.

It returns the last 5 rows of whatever you're picking from as soon as you slap an order by x DESC limit 5 on there. You'd have to use something like order by rand() LIMIT 1 to get rows returned at random.

However, if your indexes aren't set up correctly, it can slow things down. I usually do a min/max to get the IDs on the table, then a random number between them, then choose those records (in your instance, just one), which is usually faster than letting the database do the work, especially on a huge dataset.