Showing posts with label Apache. Show all posts
Showing posts with label Apache. Show all posts

December 30, 2021

Top 20 Apache Tapestry Interview Questions and Answers

 

                Do you want to succeed in the Apache Tapestry and advance your career? Then, on this page, we will offer you with the complete set of Apache Tapestry job Interview Questions and Answers. Apache Tapestry is a Java-based open source web framework. It's a web framework that's built around components. Apache Tapestry is a framework for building extremely scalable web applications. Many top firms offer Apache Tapestry jobs in a variety of positions. In the Java programming language, there are more opportunities for experienced workers. Job hunting may be difficult and exhausting, especially if you don't know how to apply, where to look, or how to prepare for job interviews. To minimise any misunderstanding, we've framed Apache tapestry job interview questions and answers to help you prepare for your interview.

 Apache Hive Interview Questions & Answers

Ques. 1): How do I run multiple Tapestry applications in the same web application?

Answer:

Multiple Tapestry 5 apps are not supported; there is only one place to identify the application root package, therefore configuring multiple filters into multiple directories isn't an option.

Tapestry 5 did not include support for numerous Tapestry applications in the same web application (it needlessly complicated Tapestry 4). Given how disjointed Tapestry 5 pages are, there doesn't appear to be a benefit to doing so... and, if it were possible, there would be a considerable drawback in terms of memory use.

You can run Tapestry 4 and Tapestry 5 apps side by side (the package names are different for this reason), but they have no knowledge of each other and are unable to interact directly. This is just like the way you could have a single WAR with multiple servlets; the different applications can only communicate via URLs, or shared state in the HttpSession.

 Apache Ambari interview Questions & Answers

Ques. 2): Tapestry focuses on the wrong field in my form, how do I fix that?

Answer:

Tapestry usually determines which field in your form should receive initial emphasis by giving a FieldFocusPriority to each field as it renders, which translates to the following logic:

There is an error in the first field.

Alternatively, the first mandatory field

Alternatively, the first field

Due to a variety of reasons beyond Tapestry's control, its choices may not always be exactly what you desire, necessitating the use of an override. The data is kept track of in the JavaScriptSupport environment. It's simply a matter of injecting the component to get its client id, then notifying JavaScriptSupport of your override.

Here's an example

 <t:textfield t:id="email" t:mixins="OverrideFieldFocus" .../>

The OverrideFieldFocus mixin forces the email field to be the focus field, regardless.

 Apache NiFi Interview Questions & Answers

Ques. 3): Is Tapestry A Jsp Tag Library?

Answer :

Tapestry is not a JSP tag library; it is based on the servlet API but does not make use of JSPs. It has its own rendering engine and HTML template format. Tapestry now offers a simple JSP tag library in release 3.0, allowing JSP pages to link to Tapestry pages.

Apache Spark Interview Questions & Answers

Ques. 4): I Have A Form With A Submit Button. On The Form And The Submit Button Are Two Separate Listeners. Which Is Invoked First?

Answer :

While the shape encounters your button at some time throughout the rewind, the button's listener must be called. The form's submitListener must be called after the shape has completed its rewind, therefore different listeners were called in either scenario. Note that this could mean that the listener for a button can be called before the form has'submitted' all of its values; it all depends on where your input fields are in relation to your button.

 

Ques. 5): Is There A Wysiwyg Editor For Tapestry, Or An Ide Plugin?

Answer :

Tapestry currently lacks a WYSIWYG editor; nonetheless, the nature of Tapestry allows existing editors to function reasonably well (Tapestry additions to the HTML markup are virtually invisible to a WYSIWYG editor). Spindle is a Tapestry plugin for the Eclipse IDE, which is free and open-source. Tapestry apps, pages, and components may now be created using wizards and editors.

 

Ques. 6): How Is The Performance Of Tapestry?

Answer :

Other testing (recorded in the Tapestry discussion boards) coincides with my own testing, which was published in the September 2001 issue of the Java Report: Although plain JSPs have a minor advantage in demo applications, performance curves for equal Tapestry and JSP applications with a database or application server backend are identical. Consider the performance of your Java developers rather than the performance of Tapestry.

 

Ques. 7): What’s The Lifecycle Of A Form Submit?

Answer :

Events will trigger in the following order:

initialize()

pageBeginRender()

formListenerMethod()

pageBeginRender()

The form "rewind" cycle is simply a render cycle with the output buffered and scraped instead of being written to the servlet output stream. The second pageBeginRender() is called while the page is being rendered. To distinguish between these two render cycles, use requestCycle.isRewinding().

 

Ques. 8): Does Tapestry Work With Other Other Application Servers Besides Jboss?

Answer :

Yes, of course! For the turn-key demonstrations, JBoss is free and convenient. In less than a minute, you can download Tapestry and JBoss and have a real J2EE application running! JBoss configuration scripts are specific to a specific release of JBoss, which must be 3.0.6. Tapestry apps, on the other hand, are completely container agnostic... Tapestry is unconcerned with the servlet container it's in, and it doesn't even require an EJB container.

 

Ques. 9): Can I Use The Same Component Multiple Times In One Template?

Answer:

No – but you can copy the definition of a component pretty easily.

<component id=”valueInsert” type=”Insert” >

 <binding name=”value” expression=”getValueAt( rowIndex, columnIndex )” />

 </component>

<component id=”valueInsert1″ copy-of=”valueInsert”/>

 <component id=”valueInsert2″ copy-of=”valueInsert”/>

 <component id=”valueInsert3″ copy-of=”valueInsert”/>

 <component id=”valueInsert4″ copy-of=”valueInsert”/>

 

Ques. 10): Why is @script required in Apache Tapestry?

Answer:

The script framework is a useful tool for grouping scripts into components. It delivers the benefits of components to scripts. It may now be utilised as a component without having to bother about renaming field names or rewiring the fields and scripts. All you have to do now is declare the component and you're ready to go. It is true that another layer of abstraction must be acquired, but once learned, it is extremely powerful. And, to tell you the truth, there isn't much to it.

The script framework is required since form element/field names are produced automatically by the framework. As a result, you write your script in XML, assigning these names to variables, and relying on the framework to deliver the correct names at runtime. Further, you can request that the framework include extra objects that will aid in the creation of your script.

 

Ques. 11): When a form is submitted, why does Tapestry send a redirect?

Answer:

This is a variant of the Post/Redirect/Get strategy. It ensures that if a user resubmits the resultant page after an operation that alters server-side state, such as a form submission, the operation is not repeated; instead, only the results of the operation, reflecting the updated server-side state, are re-rendered.

This has the unwelcome consequence of requiring any data required to produce the answer to persist between the event request (the form submission) and the render request; this frequently necessitates the use of @Persist annotations on fields.

 

Ques. 12): When I utilise an HTML object like &nbsp; in my template, why do I get a SAXParseException?

Answer:

Tapestry reads your templates using a regular SAX parser. This means that your templates must be well-formed, with balanced open and close tags, quoted attribute values, and defined entities. The simplest method to do this is to include a DOCTYPE at the top of your template:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Part of the DOCTYPE is the declaration of entities such as &nbsp;.

You can also use the numeric version: &#160; This is the exact same character and will render in the browser identically. When no doctype is present, Tapestry adds an XHTML doctype; this ensures that common HTML entities behave properly.

 

Ques. 13): When I submit a form, I get an error message that says "Tapestry is undefined." Why?

Answer:

This client-side mistake is obvious, yet it can be difficult to resolve. It implies your browser was unable to correctly load the tapestry.js file. Why, exactly, is the question? It could be for a variety of causes, some of which are listed below:

Check to see if 'tapestry.js' is present in the head section of your HTML page.

Tapestry generates a single URL to get all the JS files if the tapestry.combine-scripts configuration parameter is set to true. This can sometimes result in long URLs that browsers are unable to get. Set the sign to false and see what happens.

If you use jQuery in conjunction with Tapestry's prototype, the '$' selection used by both will clash. Put jQuery on top of the stack and enable the jQuery.noConflict mode in this scenario.

Also, if you've added a custom or third-party JS library to the stack and it's causing the JavaScript parsing to fail, look at the JavaScript syntax in that library.

If you used a programme to compress your JavaScript libraries, it's possible that you'll get JavaScript syntax issues, so make sure it works with all of the JavaScript files unpacked.

 

Ques. 14): Do I have to specify both id and t:id for Zone components?

Answer:

The Zone component's examples (in the Component Reference) always give both id and t:id, which is definitely a good thing.

In general, the client-side id (the id attribute) will be the same as the Tapestry component id if you don't define it (t:id).

There are, however, numerous exceptions to this rule. It's possible that the Zone is rendering inside a Loop (in which case, each rendering will have a unique client side id). A random unique id is added into the id if the Zone is rendering as part of a partial page render. Tapestry component ids in nested components can potentially clash in other situations.

 

Ques. 15): When rendering an empty Zone, why do I get the exception "The produced content did not include any components that allow for the positioning of the concealed form field's element."?

Answer:

Tapestry must write a hidden input element with information needed when the form is submitted as part of its form processing. Because the content of a Zone can be altered or erased, a hidden field separate from the rest of the enclosing form is established solely for the Zone.

At the same time, Tapestry wants to position the <input> field in a valid location, and HTML defines some constraints for that; an input field must appear inside a <p> or <div> element. If your zone is initially empty, there's no place to put the hidden element, and Tapestry will complain.

The solution is simple: just add a <div> element to the body of the zone. This ensures that there's a place for the hidden input field.  An empty <div> element (even one containing a hidden form field) will not affect page layout.

 

Ques. 16): Why is it necessary for me to provide an interface for my services? Why can't I simply use the class?

Answer:

To begin with, you can do just this, but you will lose some of the functionality provided by Tapestry's IoC container.

Tapestry will be able to provide functionality for your service around the core service implementation as a result of the split. This is accomplished by using proxies, which are Java classes that implement the service interface. The proxy's methods will eventually call the methods of your service implementation.

One of the most important functions of proxies is to encapsulate the life cycle of a service: most services are singletons that are built just once. Just in time refers to the moment you call a method. What's going on is that the life cycle proxy (the object that gets injected into pages, components or other service implementations) checks on each method invocation to see if the actual service exists yet. If not, it instantiates and configures it (using proper locking to ensure thread safety), then delegates the method invocation to the service.

If you bind a service class (not a service interface and class), then the service is fully instantiated the first time it is injected, rather than at that first method invocation. Further, you can't use decorations or method advice on such a service. The life cycle proxy (the object injected into pages, components, and other service implementations) examines each method invocation to determine if the actual service has been created yet. If it isn't already instantiated and configured (with suitable locking to ensure thread safety), it delegated method invocation to the service.

When you bind a service class (rather than a service interface and class), the service is fully instantiated the first time it is injected, rather than when the first method is invoked. Furthermore, such a service does not allow for the use of decorations or technique suggestions.

The final reason for the service interface / implementation split is to nudge you towards always coding to an interface, which has manifest benefits for code structure, robustness, and testability.

 

Ques. 17): How do I make my service startup with the rest of the application, rather than lazily?

Answer:

Tapestry services are designed to be lazy; they are only fully realized when needed: when the first method on the service interface is invoked.

Sometimes a service does extra work that is desirable at application startup: examples may be registering message handlers with a JMS implementation, or setting up indexing. Since the service's constructor (or @PostInjection methods) are not invoked until the service is realized.

The solution is the @EagerLoad annotation; service implementation classes marked with this annotation are loaded when the Registry is first startup, rather than lazily.

 

Ques. 18): How can I dynamically add new components to an existing page?

Answer:

You don't, to put it succinctly. The long answer is that you don't have to in order to achieve the desired behaviour.

High scalability is one of Tapestry's core values; it can be expressed in a variety of ways, reflecting scalability concerns both within a single server and across a cluster of servers.

Although you code Tapestry pages and components as if they were ordinary POJOs (Plain Old Java Objects — Tapestry doesn't require you to extend any base classes or implement any special interfaces), they behave more like a traditional servlet when deployed by Tapestry: a single instance of each page serves requests from multiple threads. In the background,

 

Ques. 19): What this means is that any incoming request must be handled by a single page instance. Therefore, Tapestry enforces the concept of static structure, dynamic behavior.

Answer:

 Beyond simple conditionals and loops, Tapestry offers a variety of options for varying what content is presented. When rendering a page, you can "drag in" components from other pages (other FAQs will expand on this concept). The idea is that, while the structure of a Tapestry page is quite strict, the order in which the page's components render does not have to be top to bottom.

 

Ques. 20): Why do my images and stylesheets end up with a weird URLs like /assets/meta/zeea17aee26bc0cae/layout/layout.css?

Answer:

The servlet container isn't used by Tapestry to serve static assets (images, stylesheets, flash movies, etc.). Tapestry, on the other hand, handles the requests and streams assets to the browser.

The content of the assets will be compressed using GZIP (if the client supports compression, and the content is compressible). Tapestry will also add an expires header to the content in the future. This means the browser will not ask for the file again, resulting in a significant reduction in network traffic.

The strange hex string is a fingerprint; it's a hash code calculated from the asset's real content. If the asset changes, a new fingerprint will be created, as well as a new path and (immutable) resource. This approach, combined with a far-future expires header also provided by Tapestry, ensures that clients aggressively cache assets as they navigate your site, or even between visits.



November 17, 2021

Top 20 Apache NiFi Interview Questions and Answers

  

Ques: 1). Is there a functional overlap between NiFi and Kafka?

Answer: 

This is a pretty typical question, and the situation is actually extremely complementary. When you have a large number of customers drawing from the same topic, a Kafka broker gives very low latency. However, Kafka isn't built to tackle dataflow problems - imagine data prioritization and enrichment — Kafka isn't built for that. Furthermore, unlike NIFI, which can handle messages of any size, Kafka prefers smaller messages in the KB to MB range, whereas NiFi can accept files up to GB per file or more. NiFi is an add-on to Kafka that solves all of Kafka's dataflow issues.


BlockChain Interview Question and Answers


Ques: 2). What is Apache NiFi, and how does it work?

Answer: 

Apache NiFi is a dataflow automation and enterprise integration solution that allows you to send, receive, route, alter, and modify data as needed, all while being automated and configurable. NiFi can connect many information systems as well as several types of sources and destinations such as HTTP, FTP, HDFS, File System, and various databases.


Apache Spark Interview Questions & Answers


Ques: 3). Is NiFi a viable alternative to ETL and batch processing?

Answer: 

For certain use situations, NiFi can likely replace ETL, and it can also be utilised for batch processing. However, the type of processing/transformation required by the use case should be considered. Flow Files are used in NiFi to define the events, objects, and data that pass through the flow. While NiFi allows you to perform any transformation per Flow File, you shouldn't use it to combine Flow Files together based on a common column or perform certain sorts of windowing aggregations. Cloudera advises utilising extra solutions in this situation.

The ideal choice in a streaming use scenario is to have the records transmitted to one or more Kafka topics utilising NiFi's record processors. Based on our acquisition of Eventador, you can then have Flink execute any of the processing you want on this data (joining streams or doing windowing operations) using Continuous SQL.

NiFi would be treated as an ELT rather than an ETL in a batch use scenario (E = extract, T = transform, L = load). NiFi would collect the various datasets, do the necessary transformations (schema validation, format transformation, data cleansing, and so on) on each dataset, and then transmit the datasets to a Hive-powered data warehouse. Once the data is sent there, NiFi could trigger a Hive query to perform the joint operation.


Apache Hive Interview Questions & Answers


Ques: 4). Is Nifi a Master-Server Architecture?

Answer: 

No, the 0-master philosophy has been considered since NiFi 1.0. In addition, each node in the NiFi cluster is identical. The Zookeeper is in charge of the NiFi cluster. ZooKeeper chooses a single node to serve as the Cluster Coordinator, and ZooKeeper handles failover for you. The Cluster Coordinator receives heartbeat and status information from all cluster nodes. The Cluster Coordinator is in charge of detaching and reconnecting nodes in the cluster. Every cluster also has one Primary Node, which is chosen by ZooKeeper.


Apache Ambari interview Questions & Answers


Ques: 5). What is the role of Apache NiFi in Big Data Ecosystem?

Answer: 

The main roles Apache NiFi is suitable for in BigData Ecosystem are:

Data acquisition and delivery.

Transformations of data.

Routing data from different source to destination.

Event processing.

End to end provenance.

Edge intelligence and bi-directional communication.


Apache Tapestry Interview Questions and Answers


Ques: 6). What are the component of flowfile?

Answer: 

There are two sections to a FlowFile:

Content: The content is a stream of bytes that transports from source to destination and contains a pointer to the actual data being processed in the dataflow. Keep in mind that the flowfile is merely a link to the content data, not the data itself. The actual content will be stored in NiFi's Content Repository.

Attributes: The attributes are key-value pairs that are associated with the data and serve as the flowfile's metadata. These characteristics are typically used to store values that give meaning to the data. Filename, UUID, and other properties are examples. MIME Type, Flowfile creating time etc.


Apache Kafka Interview Questions and Answers


Ques: 7). What exactly is the distinction between MiNiFi and NiFi?

Answer: 

Agents called MiNiFi are used to collect data from sensors and devices in remote areas. The purpose is to assist with data collection's "initial mile" and to obtain data as close to its source as possible.

These devices can include servers, workstations, and laptops, as well as sensors, self-driving cars, factory machinery, and other devices where you want to collect specialised data using MiNiFi's NiFi features. Before transferring data to a destination, the ability to filter, select, and triage it.

The objective of MiNiFi is to manage this entire process at scale with Edge Flow Manager so the Operations or IT teams can deploy different flow definitions and collect any data as the business requires. Here are some details to consider:

To move data around or collect data from well-known external systems like databases, object stores, and so on, NiFi is designed to be centrally situated, usually in a data centre or in the cloud. In a hybrid cloud architecture, NiFi should be viewed as a gateway for moving data back and forth between diverse environments.

MiNiFi connects to a host, does some processing and logic, and only distributes the data you care about to external data distribution platforms. Of course, such systems can be NiFi, but they can also be MQTT brokers, cloud provider services, and so on. MiNiFi also enables use scenarios where network capacity is constrained and data volume transferred over the network must be reduced.

MiNiFi is available in two flavours: C++ and Java. The MiNiFi C++ option has a modest footprint (a few MBs of memory, a small CPU), but a limited number of processors. The MiNiFi Java option is a single-node lightweight version of NiFi that lacks the user interface and clustering capabilities. It does, however, necessitate the presence of Java on the host.


Apache Struts 2 Interview Questions and Answers


Ques: 8). Will we be able to arrange the flow to automotive management after the coordinator is in place?

Answer: 

As Apache NiFi is designed to work on the idea of continuous streaming, the processors are already set for eternity twist by default. Unless we opt to handle a processor without assistance, for example, on an hourly or daily basis today. Apache NiFi, on the other hand, isn't supposed to be a job-oriented matter. When we put a processor in the bureau, it operates all of the time.

 

Apache Tomcat Interview Questions and Answers


Ques: 9). What are the main features of NiFi?

Answer: 

The main features of Apache NiFi are.

Highly Configurable: Apache NiFi is highly flexible in configurations and allows us to decide what kind of configuration we want. For example, some of the possibilities are.

Loss tolerant cs Guaranteed delivery

Low latency vs High throughput

Dynamic prioritization

Flow can be modified at runtime

Back pressure

Designed for extension:We can build our own processors and controllers etc.

Secure:

SSL, SSH, HTTPS, encrypted content etc.

Multi-tenant authorization and internal authorization/policy management

 

Apache Ant Interview Questions and Answers


Ques: 10). Is there a NiFi connector for any RDBMS database?

Answer: 

Yes, different processors included in NiFi can be used to communicate with RDBMS in various ways. For example, "ExecuteSQL" lets you issue a SQL SELECT statement to a configured JDBC connection to retrieve rows from a database; "QueryDatabaseTable" lets you incrementally fetch from a DB table; and "GenerateTableFetch" lets you not only incrementally fetch the records, but also against source table partitions.

 

Apache Camel Interview Questions and Answers


Ques: 11). What is the best way to expose REST API for real-time data collection at scale?

Answer: 

Our customer utilises NiFi to expose a REST API allowing data to be sent to a destination from external sources. HTTP is the most widely used protocol.

If you want to ingest data, you'll utilise the ListenHTTP processor in NIFi, which you may configure to listen to a certain port for HTTP requests and deliver any data to.

Look at the HandleHTTPRequest and HandleHTTPResponse processors if you wish to implement a web service with NiFi. You will receive an HTTP request from an external client if you use the two processors together. You'll be able to respond to the customer with a customised answer/result based on the data in the request. For example, you can use NiFi to connect to remote systems via HTTP, such as an FTP server. The two processors would be used, and the request would be made over HTTP. When NIFi receives a query, it runs a query on the FTP server to retrieve the file, which is then returned to the client.

NiFi can handle all of these one-of-a-kind needs with ease. In this scenario, NiFi would scale horizontally to meet the needs, and a load balancer would be placed in front of the NiFi instances to distribute the load throughout the cluster's NiFi nodes.

 

Apache Cassandra Interview Questions and Answers


Ques: 12). When NiFi pulls data, do the attributes get added to the content (real data)?

Answer: 

You may absolutely add attributes to your FlowFiles at any moment; after all, the purpose of separating metadata from actual data is to allow you to do so. A FlowFile is a representation of an object or a message travelling via NiFi. Each FlowFile has a piece of content, which are the bytes themselves. The properties can then be extracted from the material and stored in memory. You can then use those properties in memory to perform operations without having to touch your content. You can save a lot of IO overhead this way, making the entire flow management procedure much more efficient.

 

Apache NiFi Interview Questions and Answers


Ques: 13). Is it possible for NiFi to link to external sources such as Twitter?

Answer: 

Absolutely. NIFI's architecture is extremely flexible, allowing any developer or user to quickly add a data source connector. We had 170+ processors packaged with the application by default in the previous edition, NIFI 1.0, including the Twitter processor. Every release will very certainly include new processors/extensions in the future.

 

Apache Storm Interview Questions and Answers


Ques: 14). What's the difference between NiFi and Flume cs Sqoop?

Answer: 

NiFi supports all of Flume's use cases and includes the Flume processor out of the box.

Sqoop's features are also supported by NiFi. GenerateTableFetch, for example, is a processor that performs incremental and concurrent fetches against source table partitions.

At the end of the day, we want to know if we're solving a specific or unique use case. If that's the case, any of the tools will suffice. When we consider several use cases being handled at once, as well as essential flow management features like interactive, real-time command and control with full data provenance, NiFi's benefits will really shine.

 

Ques: 15).What happens to data if NiFi goes down?

Answer: 

As data moves through the system, NiFi stores it in the repository. There are three important repositories:

The flowfile repository.

The content repository.

The provenance reposiroty.

When a processor finishes writing data to a flowfile that is streamed directly to the content repository, it commits the session. This updates the provenance repository to include the events that occurred for that processor, and it also updates the flowfile repository to maintain track of where the file is in the flow. Finally, the flowfile can be moved to the flow's next queue.

NiFi will be able to restart where it left off if it goes down at any point. This, however, overlooks one detail: when we update the repositories, we write the into the repository by default, but the OS frequently caches this. If the OS dies together with NiFi in the event of a failure, the cached data may be lost. If we absolutely want to eliminate caching, we can set the nifi.properties file's repositories to always sync to disc. This, on the other hand, can be a severe impediment to performance. If NiFi goes down, it will have no effect on data because the OS will still be responsible for flushing the cached data to the disc.

 

Ques: 16). What Is The Nifi System's Backpressure?

Answer: 

Occasionally, the producer system outperforms the consumer system. As a result, the messages consumed are slower. As a result, all unprocessed communications (FlowFiles) will be stored in the connection buffer. However, you can set a restriction on the magnitude of the connection backpressure based on the number of FlowFiles or the quantity of the data. If it exceeds a predetermined limit, the link will send back pressure to the producing processor, causing it to stop working. As a result, until the backpressure is removed, no new FlowFiles will be generated.

 

Ques: 17). What Is Bulleting In Nifi And How Does It Help?

Answer: 

If you want to know if a dataflow has any issues. You can look through the logs for anything intriguing, but having notifications appear on the screen is far more convenient. A "Bulletin Indicator" will appear in the top-right-hand corner of the Processor if it logs anything as a WARNING or ERROR.

This sign, which resembles a sticky note, will appear for five minutes after the incident has occurred. By hovering over the bulletin, the user can get information about what happened without having to search through log messages. If in a cluster, the bulletin will also indicate which node in the cluster emitted the bulletin. We can also change the log level at which bulletins will occur in the Settings tab of the Configure dialog for a Processor.

 

Ques: 18). When Nifi pulls data, do the attributes get added to the content (real data)?

Answer: 

You may absolutely add attributes to your FlowFiles at any moment; after all, the purpose of separating metadata from actual data is to allow you to do so. A FlowFile is a representation of an object or a message travelling via NiFi. Each FlowFile has a piece of content, which are the bytes themselves. The properties can then be extracted from the material and stored in memory. You can then use those properties in memory to perform operations without having to touch your content. You can save a lot of IO overhead this way, making the entire flow management procedure much more efficient.

 

Ques: 19). What prioritisation scheme is utilised if no prioritizers are set in a processor?

Answer: 

The default priority strategy is described as "undefined," and it is subject to change. If no prioritizers are specified, the processor will order the data using the Content Claim of the FlowFile. It delivers the most efficient data reading and the highest throughput this way. We've debated changing the default setting to First In First Out, but for now, we're going with what works best.

 

Ques: 20). If no prioritizer square measure set in a very processor, what prioritization plot is used?

Answer: 

The default prioritization theme is claimed to be undefined, and it’s going to regulate from time to era. If no prioritizer square measure set, the processor can kind the info supported the FlowFiles Content Claim. This habit provides the foremost economical reading of the info and therefore the highest output. we’ve got mentioned dynamical the default feels to initial In initial Out, however, straight away it’s primarily based happening for what offers the most effective do its stuff.

This square measure a number of the foremost normally used interview queries vis–vis Apache NiFi. To go surfing a lot of terribly regarding Apache NiFi you’ll be able to check the class Apache NiFi and entertain reach purchase the newssheet for a lot of connected articles.

Top 20 Apache Spark Interview Questions & Answers

  

Ques: 1). What is Apache Spark?

Answer:

Apache Spark is an open-source real-time processing cluster computing framework. It has a vibrant open-source community and is now the most active Apache project. Spark is a programming language that allows you to programme large clusters with implicit data parallelism and fault tolerance.

Spark is one of the Apache Software Foundation's most successful projects. Spark has unquestionably established itself as the market leader in Big Data processing. Spark is used by many enterprises on clusters with thousands of nodes. Spark is now used by large companies like as Amazon, eBay, and Yahoo!


BlockChain Interview Question and Answers


Ques: 2). What advantages does Spark have over MapReduce?

Answer:

Compared to MapReduce, Spark has the following advantages:

Spark implements processing 10 to 100 times quicker than Hadoop MapReduce due to the availability of in-memory processing, whereas MapReduce uses persistence storage for any of the data processing activities.

Unlike Hadoop, Spark has built-in libraries that allow it to do a variety of functions from the same core, including as batch processing, steaming, machine learning, and interactive SQL queries. Hadoop, on the other hand, only supports batch processing.

Hadoop is heavily reliant on discs, but Spark encourages caching and data storage in memory. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

Apache Hive Interview Questions & Answers

Ques: 3). What exactly is YARN?

Answer:

YARN is a fundamental element in Spark, similar to Hadoop, in that it provides a central and resource management platform for delivering scalable operations across the cluster. YARN, like Mesos, is a distributed container manager, but Spark is a data processing tool. Spark can be run on YARN in the same way that Hadoop Map Reduce can. Running Spark on YARN needs the use of a Spark binary distribution with YARN support.

 Apache Ambari interview Questions & Answers

Ques: 4). What's the difference between an RDD, a Dataframe, and a Dataset?

Answer:

Resilient Distributed Dataset (RDD) - RDD stands for Resilient Distributed Dataset. It is the most basic data structure in Spark, consisting of an immutable collection of records partitioned among cluster nodes. It allows us to do fault-tolerant in-memory calculations on massive clusters.

RDD, unlike DF and DS, will not keep the schema. It merely stores data. If a user wants to apply a schema to an RDD, they must first build a case class and then apply the schema to the data.

We will use RDD for the below cases:

-When our data is unstructured, A streams of text or media streams.

-When we don’t want to implement any schema.

-When we don’t care about the column name attributes while processing or accessing.

-When we want to manipulate the data with functional programming constructs than domain specific expressions.

-When we want low-level transformation, actions and control on the dataset.

DataFrame:

-Like RDD DataFrames are immutable collection of data.

-Unlike RDD DataFrame will have schema for their data making user to easily access/process large set of data which is distributed among the nodes of cluster.

-DataFrame provides a domain specific language API to manipulate distributed data and makes Spark accessible to a wider audience, beyond specialized data engineers.

-From Spark 2.x Spark DataFrames are nothing but Dataset[Row] or alias (Untyped API)

consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object

DataSet:

Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. To implement case class on RDD and use as Dataset[T].

 Apache Tapestry Interview Questions and Answers

Ques: 5). Can you explain how you can use Apache Spark along with Hadoop?

Answer:

Apache Spark provides the benefit of being Hadoop compatible. They make a powerful tech team when they work together. Using Apache Spark and Hadoop combines the processing capability of Spark with the best capabilities of HDFS and YARN. The following are some examples of how to use Hadoop Components with Apache Spark:

Batch & Real-Time Processing – MapReduce and Spark can work together, where the former handles batch processing, and the latter handles real-time processing.

HDFS – Spark can make use of the HDFS to leverage the distributed replicated storage.

MapReduce – Apache Spark can be used in conjunction with MapReduce in a similar Hadoop cluster or independently as a processing framework.

YARN – You can run Spark applications on YARN.

 Apache NiFi Interview Questions & Answers

Ques: 6). What is the meaning of the term "spark lineage"?

Answer:

• In Spark, regardless of the actual data, all dependencies between RDDs will be logged in a graph. In Spark, this is referred to as a lineage graph.

• RDD stands for Resilient Distributed Dataset, with the term "resilient" referring to fault tolerance. We can re-compute the missing or damaged partition due to node failure using RDD Lineage Graph. When we generate new RDDs based on existing RDDs, we use lineage graph spark to handle the dependencies. Each RDD keeps a pointer to one or more parents, together with metadata describing the type of relationship it has with the parent RDD.

• RDD Lineage Graph in Spark can be obtained using the ToDebugString method.

 

Ques: 7). List the various components of the Spark Ecosystem.

Answer:

These are the five types of components in the Spark Ecosystem:

GraphX: Enables graphs and graph-parallel computation.

MLib: It is used for machine learning.

Spark Core: A powerful parallel and distributed processing platform.

Spark Streaming: Handles real-time streaming data.

Spark SQL: Combines Spark's functional programming API with relational processing.

 

Ques: 8). What is RDD in Spark? Write about it and explain it.

Answer:

Resilient Distributed Dataset (RDD) is an acronym for Resilient Distributed Dataset. RDDs are a fault-tolerant core data structure in Spark that is immutable. They've disseminated partitioned datasets among the cluster nodes.

Parallelizing and referencing a data set are the two methods for constructing RDDS. Lazy evaluation is the responsibility of the RDDS. The faster processing performance in Spark is due to the lazy evaluation of RDDs.

 

Ques: 9). In Spark, how does streaming work?

Answer:

Spark gets data in real time that is separated into batches. The Spark Engine processes these batches of data, and the final stream of results is returned back in batches. DStream, or Discretized Stream, is the most basic stream unit in Spark.

 

Ques: 10). Is it feasible to access and analyse data stored in Cassandra databases using Apache Spark?

Answer: 

Yes, Apache Spark may be used to retrieve and analyse data stored in Cassandra databases. Apache Spark can access and analyse data contained in Cassandra databases using the Spark Cassandra Connector. Spark should have a functionality that allows Spark executors to communicate with local Cassandra nodes and request just local data.

Cassandra and Apache Spark can be connected to speed up queries by lowering network traffic between Spark executors and Cassandra nodes.

 

Ques: 11). What are the advantages of using Spark SQL?

Answer:

Spark SQL carries out the following tasks:

Loads data from a variety of structured datasources, such as a relational database management system (RDBMS).

It may query data using SQL commands within the Spark programme as well as JDBC/ODBC connectors from third-party tools such as Tableau.

It can also provide SQL and Python/Scala code interaction.

 

Ques: 12). What is the purpose of Spark Executor?

Answer:

The Executors are obtained on top of worker nodes in the clusters when a SparkContext is formed. Spark Executors are in charge of performing computations and storing data on the worker node. They are also in charge of returning the results to the driver.

 

Ques: 13). What are the advantages and disadvantages of Spark?

Answer:

Advantages: Spark is known for real-time data processing, which may be employed in applications such as stock market analysis, finance, and telecommunications.

Spark's stream processing allows for real-time data analysis, which can aid in fraud detection, system alarms, and other applications.

Due to its lazy evaluation mechanism and parallel processing, Spark processes data 10 to 100 times quicker.

Disadvantages: When compared to Hadoop, Spark consumes greater storage space.

The task is distributed over numerous clusters rather than taking place on a single node.

Spark's in-memory processing might be costly when dealing with large amounts of data.

When compared to Hadoop, Spark makes better use of data.

 

Ques: 14). What are some of the drawbacks of utilising Apache Spark?

Answer:

The following are some of the drawbacks of utilising Apache Spark:

There is no file management system built-in. To take benefit of a file management system, integration with other platforms such as Hadoop is essential.

Higher latency, but lower throughput as a result

It does not support the processing of real-time data streams. In Apache Spark, live data streams are partitioned into batches, which are then processed and turned back into batches. To put it another way, Spark Streaming is more like micro-batch data processing than true real-time data processing.

There are fewer algorithms available.

Record-based window requirements are not supported by Spark streaming. It is necessary to distribute work across multiple clusters instead of running everything on a single node.

Apache Spark's in-memory ability becomes a bottleneck when used for the cost-efficient processing of big data.

 

Ques: 15). Is Apache Spark compatible with Apache Mesos?

Answer:

Yes. Spark can work on Apache Mesos-managed clusters, just as it works on YARN-managed clusters. Spark may run without a resource manager in standalone mode. If it has to execute on multiple nodes, it can use YARN or Mesos.

 

Ques: 16). What are broadcast variables, and how do they work?

Answer:

Accumulators and broadcast variables are the two types of shared variables in Spark. Instead of shipping back and forth to the driver, the broadcast variables are read-only variables cached in the Executors for local referencing. A broadcast variable preserves a read-only cached version of a variable on each computer instead of delivering a copy of the variable with tasks.

Additionally, broadcast variables are utilised to distribute a copy of a big input dataset to each node. To cut transmission costs, Apache Spark distributes broadcast variables using efficient broadcast algorithms.

There is no need to replicate variables for each task when using broadcast variables. As a result, data can be processed quickly. In contrast to RDD lookup(), broadcast variables assist in storing a lookup table inside the memory, enhancing retrieval efficiency.

 

Ques: 17). In Apache Spark, how does caching work?

Answer:

Caching RDDs in Spark speeds up processing by allowing numerous accesses to the same RDD. The function of Discretized Streams, or DStreams, in Spark streaming is to allow users to cache or retain data in memory.

The functions cache () and persist(level) are used to cache data in memory and cache memory based on the storage level specified, respectively.

The persist () without the level specifier is the same as cache, which means it caches the data in memory. The persist(level) method caches data at the provided storage level, such as on disc, on the server, or in off-heap memory.

 

Ques: 18).  What exactly is Akka? What does Spark do with it?

Answer:

Akka is a Scala and Java framework for reactive, distributed, parallel, and robust concurrent applications. Akka is the foundation for Apache Spark.

When assigning tasks to worker nodes, Spark employs Akka for job scheduling and messaging between the master and the worker node.

 

Ques: 19). What applications do you utilise Spark streaming for?

Answer:

When real-time data must be streamed into the Spark programme, this method is employed. It can be broadcast from a variety of places, such as Kafka, Flume, Amazon Kinesis, and others. For processing, the streamed data is separated into batches.

Spark streaming is used to conduct real-time sentiment analysis of customers on social media sites like as Twitter and Facebook, among others.

Live streaming data processing is critical for detecting outages, detecting fraud in financial institutions, and making stock market predictions, among other things.

 

Ques: 20). What exactly do you mean when you say "lazy evaluation"?

Answer:

The way Spark works with data is intellectual. When you ask Spark to perform a task on a dataset, it follows your instructions and records them so that it doesn't forget them - but it doesn't do anything until you tell it to. When map() is invoked on an RDD, the operation is not done immediately. Transformations aren't evaluated by Spark until you use them. This aids in the overall data processing workflow optimization.




Top 20 Apache Hive Interview Questions & Answers


Ques: 1). What is Apache Hive, and how does it work?

Answer:

Apache Hive is a Hadoop-based, sophisticated warehouse project. This platform focuses on data analysis and includes data query capabilities. Hive is comparable to SQL in that it provides a user interface for querying data stored in files and database systems. And Apache Hive is a popular data analysis and querying technology used by Fortune 500 companies around the world. When it is cumbersome or inefficient to run the logic in HiveQL, Hive allows standard map reduce programmes to customise mappers and reducers (User Defined Functions UDFS).

 

BlockChain Interview Question and Answers


Ques: 2). What is the purpose of Hive?

Answer:

Hive is a Hadoop tool that allows you to organise and query data in a database-like format, as well as write SQL-like queries. It can be used to access and analyse Hadoop data using SQL syntax.

 Apache Ambari interview Questions & Answers

Ques: 3). What are the differences between local and remote meta stores?

Answer:

Local meta store: When using the Local Meta store configuration, the specified meta store service, as well as the Hive service, will run on the same Java Virtual Machine (JVM) and connect to databases that are operating in distinct JVMs, either on the same machine or on a remote machine.

Remote meta store: The Meta store service and the Apache Hive service will execute on distinct JVMs in the Remote Meta store. To connect to meta store servers, all other processes use Thrift Network APIs. You can have many meta store servers in Remote meta store for high availability.

Apache Tapestry Interview Questions and Answers

Ques: 4). Explain the core difference between the external and managed tables?

Answer:

The following are the fundamental distinctions between managed and external tables:

When a managed table is dropped, the complete metadata and table data is lost. The Hive just deletes the metadata information associated with a table and leaves the table data in HDFS, whereas the external table is quite different.

Tables that are managed and tables that are external. Hive manages the data by default when you create a table, which means it moves the data into its warehouse directory. Alternatively, you can construct an external table, which instructs Hive to refer to data stored somewhere other than the warehouse directory.

The semantics of LOAD and DROP show the difference between the two table types. Let's start with a managed table. Data loaded into a managed table is stored in Hive's warehouse directory.

 Apache NiFi Interview Questions & Answers

Ques: 5). What is the difference between a read-only schema and a write-only schema?

Answer:

A table's schema is enforced at data load time in a conventional database. The data being loaded is rejected if it does not conform to the schema. Because the data is validated against the schema when it is written into the database, this architecture is frequently referred to as schema on write.

Hive, on the other hand, verifies data when it is loaded, rather than when it is queried. This is referred to as schema on read.

Between the two approaches, there are trade-offs. Because the data does not have to be read, parsed, and serialized to disc in the database's internal format, schema on read allows for a very quick first load. A file copy or move is all that is required for the load procedure. It's also more adaptable: think of having two schemas for the same underlying data, depending on the analysis. (External tables can be used in Hive for this; see Managed Tables and External Tables.)

Because the database can index columns and compress the data, schema on write makes query time performance faster. However, it takes longer to load data into the database as a result of this trade-off. Furthermore, in many cases, the schema is unknown at load time, thus no indexes can be applied because the queries have not yet been formed. Hive really shines in these situations.

 Apache Spark Interview Questions & Answers

Ques: 6). Write a query to insert a new column? Can you add a column with a default value in Hive?

Answer:

ALTER TABLE test1 ADD COLUMNS (access_count1 int); You cannot add a column with a default value in Hive. The addition of the column has no effect on the files that support your table. Hive interprets NULL as the value for every cell in that column in order to deal with the "missing" data.

In Hive, you must effectively recreate the entire table, this time with the column filled in. It's possible that rerunning your original query with the additional column will be easier. Alternatively, you might add the column to the table you already have, then select all of its columns plus the new column's value.


Ques: 7). What is the purpose of Hive's DISTRIBUTED BY clause?

Answer:

DISTRIBUTE BY determines how map output is split between reducers. By default, MapReduce computes a hash on the keys output by mappers and uses the hash values to try to distribute the key-value pairs evenly among the available reducers. Let's say we want all of the data for each value in a column to be collected at the same time. To ensure that the records for each get to the same reducer, we can use DISTRIBUTE BY. In the same way that GROUP BY determines how reducers receive rows for processing, DISTRIBUTE BY does the same.

If the DISTRIBUTE BY and SORT BY clauses are in the same query, Hive expects the DISTRIBUTE BY clause to be before the SORT BY clause. When you have a memory-intensive job, DISTRIBUTE BY is a helpful workaround because it requires Hadoop to employ Reducers instead of having a Map-only job. Essentially, Mappers gather data depending on the DISTRIBUTE BY columns supplied, reducing the framework's overall workload, and then transmit these aggregates to Reducers.

 

Ques: 8). What occurs when you perform a query in HIVE, please?

Answer:

The Query Planner examines the query and turns it to a Hadoop Map Reduce job’s DAG (Directed Acyclic Graph).

The jobs are submitted to the Hadoop cluster in the order that the DAG suggests.

Only mappers are used for simple queries. The Input Output format is in charge of splitting an input and reading data from HDFS. After that, the data is sent to a layer called SerDe (Serializer Deserializer). The deserializer part of the SerdDe converts data as a byte stream to a structured format in this example.

Reducers will be included in Map Reduce jobs for aggregate queries. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.

 

Ques: 9). What is the importance of STREAM TABLE?

Answer:

When you need information from several tables, joins are useful, but when you have 1.5 billion or more data in one table and want to link it to a master table, the order of the joining tables is crucial.

Consider the following scenario: 

select foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a; 

Because Hive streams the right-most table (bar) and buffers other tables (foo) in memory before executing map-side/reduce-side joins. As a result, if you buffer 1.5 billion or more records, your join query will fail since 1.5 billion records will very certainly fill up Java-Heap space exception. 

So, to overcome this limitation and free the user to remember the order of joining tables based on their record-size, Hive provides a key-word /*+ STREAMTABLE(foo) */ which tells Hive Analyzer to stream table foo.

select /*+ STREAMTABLE(foo) */ foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;

Hence, in this way user can be free of remembering the order of joining tables.

 

Ques: 10). When is it appropriate to use SORT BY instead of ORDER BY?

Answer:

When working with huge volumes of data in Apache Hive, we use SORT BY instead of ORDER BY. The fact that SORT BY comes with numerous reducers is one of the reasons for utilising it. This cuts down on the amount of time it takes to complete the task. ORDER BY, on the other hand, consists of only one reduce, which means the process takes longer than usual to complete.

 

Ques: 11). What is the purpose of Hive's Partitioning function?

Answer:

Partitioning allows users to arrange data in the Hive table in the way they want it. As a result, the system would be able to scan only the relevant data rather than the complete data set.

Consider the following scenario: Assume we have transaction log data from a business website for years such as 2018, 2019, 2020, and so on. So, in this case, you can utilise the partition key to find data for a specified year, say 2019, which will reduce data scanning by removing 2018 and 2020.

 

Ques: 12). What is dynamic partitioning and how does it work?

Answer:

The values of partition columns are exposed during runtime in dynamic partitioning, i.e. the values are known when you load data into Hive tables. The following are some examples of how dynamic partitioning is commonly used:

To move data from a non-partitioned table to a partitioned table, which reduces latency and improves sampling.

 

Ques: 13). In hive, what's the difference between dynamic and static partitioning?

Answer:

Hive partitioning is highly beneficial for pruning data during queries in order to reduce query times.

When data is inserted into a table, partitions are produced. Partitions are required depending on how data is loaded. When loading files (especially large files) into Hive tables, static partitions are usually preferred. When compared to dynamic partition, this saves time when loading data. You "statically" create a partition in the table and then move the file into that partition. 

Because the files are large, they are typically created on HDFS. Without reading the entire large file, you can retrieve the partition column value from the filename, date, and so on. In the case of dynamic partitioning, the entire large file is read, i.e. every row of data is read, and the data is partitioned into the target tables using an MR job based on specified fields in the file.

Dynamic partitions are typically handy when doing an ETL operation in your data pipeline. For example, suppose you use the transfer command to load a large file into Table X. Then you run an idle query into Table Y and split data based on table X fields such as day and country. You could wish to execute an ETL step to partition the data in Table Y's nation partition into a Table Z where the data is partitioned based on cities for a specific country alone, and so on.

Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.

 

Ques: 14).What is ObjectInspector in Hive?

Answer:

The ObjectInspector is a feature that allows us to analyze individual columns and internal structure of a row object in Hive. This also provides a seamless way to access complex objects that can be stored in varied formats in the memory.

  • A standard Java object
  • An instance of the Java class
  • A lazily initialized object

The ObjectInspector lets the users know the structure of an object and also helps in accessing the internal fields of an object.

 

Ques: 15). How does impala outperform hive in terms of query response time?

Answer:

Impala should be thought of as "SQL on HDFS," whereas Hive is more "SQL on Hadoop."

Impala, in other words, does not require Hadoop at whatsoever. It simply runs daemons on all of your nodes that store some of the data in HDFS, allowing these daemons to return data rapidly without having to conduct a full Map/Reduce process.

The rationale for this is that running a Map/Reduce operation has some overhead, so short-circuiting Map/Reduce completely can result in a significant reduction in runtime.

That stated, Impala is not a replacement for Hive; it is useful in a variety of situations. When compared to Hive, Impala does not support fault-tolerance, therefore if there is a problem during your query, it will be gone. I would recommend Hive for ETL processes where a single job failure would be costly, but Impala can be great for tiny ad-hoc queries, such as for data scientists or business analysts who just want to look at and study some data without having to develop substantial jobs.

 

Ques: 16). Explain the different components used in the Hive Query processor?

Answer:

Below mentioned is the list of Hive Query processors:

  • Metadata Layer (ql/metadata)
  • Parse and Semantic Analysis (ql/parse)
  • Map/Reduce Execution Engine (ql/exec)
  • Sessions (ql/session)
  • Type Interfaces (ql/typeinfo)
  • Tools (ql/tools)
  • Hive Function Framework (ql/udf)
  • Plan Components (ql/plan)
  • Optimizer (ql/optimizer)

 

Ques: 17). What is the difference between Hadoop Buffering and Hadoop Streaming?

Answer:

Using custom made python or shell scripts to implement your map-reduce logic is known as Hadoop Streaming. (Use the Hive TRANSFORM keyword, for example.)

In this context, Hadoop buffering refers to the phase in a map-reduce job of a Hive query with a join when records are read into the reducers after being sorted and grouped by the mappers. The author explains why you should order the join clauses in a Hive query so that the largest tables come last; this helps Hive implement joins more efficiently.

 

Ques: 18). How will the work be optimised by the map-side join?

Answer:

Let's pretend we have two tables, one of which is a little table. A Map Reduce local job will be generated before the original join Map Reduce task, which will read data from HDFS and put it into an in-memory hash table. It serialises the in-memory hash table into a hash table file after reading it.

The data in the hash table file is then moved to the Hadoop distributed cache, which populates these files to each mapper's local disc in the following stage, while the original join Map Reduce process is running. As a result, all mappers can reload this permanent hash table file into memory and perform the join operations as previously. 

The optimised map join's execution sequence is depicted in the diagram below. The short table just has to be read once after optimization. In addition, if many mappers are operating on the same system, the distributed cache only needs to send a single copy of the hash table file to this machine.

Advantages of using Map-side join:

Using Map-side join reduces the cost of sorting and combining data in theshuffle and reduces stages. The map-side join also aids task performance by reducing the time it takes to complete the assignment.

Disadvantages of Map-side join:

It is only suitable for use when one of the tables on which the map-side join operation is performed is small enough to fit into memory. As a result, performing a map-side join on tables with a lot of data in each of them isn't a good idea.

 

Ques: 19).What type of user defined functions exists in HIVE?

Answer:

A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.

A UDF must satisfy the following two properties:

  • A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
  • A UDF must implement at least one evaluate() method.

 

A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.
  • A UDAF must satisfy the following two properties:
  • A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF;
  • An evaluator must implement five methods:
    • init()
    • iterate()
    • terminatePartial()
    • merge()
    • terminate()

  • A UDTF operates on a single row and produces multiple rows — a table — as output.
  • A UDTF must be a subclass of org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
  • A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.
  • The initialize method is called by Hive to notify the UDTF the argument types to expect.
  • The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
  • Once initialize() has been called, Hive will give rows to the UDTF using the process() method.
  • While in process(), the UDTF can produce and forward rows to other operators by calling forward().
  • Lastly, Hive will call the close() method when all the rows have passed to the UDTF.

 

Ques: 20). Is the HIVE LIMIT clause truly random?

Answer:

Although the manual claims that it returns rows at random, this is not the case. Without any where/order by clause, it returns "selected rows at random" as they occur in the database. This doesn't imply it's truly random (or randomly picked), but it does suggest that the order in which the rows are returned can't be predicted.

It returns the last 5 rows of whatever you're picking from as soon as you slap an order by x DESC limit 5 on there. You'd have to use something like order by rand() LIMIT 1 to get rows returned at random.

However, if your indexes aren't set up correctly, it can slow things down. I usually do a min/max to get the IDs on the table, then a random number between them, then choose those records (in your instance, just one), which is usually faster than letting the database do the work, especially on a huge dataset.



Top 20 Apache Ambari interview Questions & Answers

  

Ques: 1). Describe Apache Ambari's main characteristics.

Answer:

Apache Ambari is an Apache product that was created with the goal of making Hadoop applications easier to manage. Ambari assists in the management of the Hadoop project.

  • Provisioning is simple.
  • Project management made simple
  • Monitoring of Hadoop clusters
  • Availability of a user-friendly interface
  • Hadoop management web UI
  • RESTful API support

 Apache Tapestry Interview Questions and Answers

Ques: 2). Why do you believe Apache Ambari has a bright future?

Answer:

With the growing need for big data technologies like Hadoop, we've witnessed a surge in data analysis, resulting in gigantic clusters. Companies are turning to technologies like Apache Ambari for better cluster management, increased operational efficiency, and increased visibility. Furthermore, we've noted how HortonWorks, a technology titan, is working on Ambari to make it more scalable. As a result, learning Hadoop as well as technologies like Apache Ambari is advantageous.

 Apache NiFi Interview Questions & Answers

Ques: 3). What are the core benefits for Hadoop users by using Apache Ambari?

Answer: 

The Apache Ambari is a great gift for individuals who use Hadoop in their day to day work life. With the use of Ambari, Hadoop users will get the core benefits:

1. The installation process is simplified
2. Configuration and overall management is simplified
3. It has a centralized security setup process
4. It gives out full visibility in terms of Cluster health
5. It is extensively extendable and has an option to customize if needed.

 Apache Spark Interview Questions & Answers

Ques: 4). What Are The Checks That Should Be Done Before Deploying A Hadoop Instance?

Answer:

Before actually deploying the Hadoop instance, the following checklist should be completed:

  • Check for existing installations
  • Set up passwordless SSH
  • Enable NTP on the clusters
  • Check for DNS
  • Disable the SELinux
  • Disable iptables

 Apache Hive Interview Questions & Answers

Ques: 5 As a Hadoop user or system administrator, why should you choose Apache Ambari?

Answer:

Using Apache Ambari can provide a Hadoop user with a number of advantages.

A system administrator can use Ambari to – Install Hadoop across any number of hosts using a step-by-step guide supplied by Ambari, while Ambari handles Hadoop installation setup.

Using Ambari, centrally administer Hadoop services across the cluster.

Using the Ambari metrics system, efficiently monitor the state and health of a Hadoop cluster. Furthermore, the Ambari alert framework sends out timely notifications for any system difficulties, like as disc space issues or node status.

 

Ques: 6). Can you explain Apache Ambari architecture?

Answer:

Apache Ambari consists of following major components-

  • Ambari Server
  • Ambari Agent
  • Ambari Web

Apache Ambari Architecture

The all metadata is handled by the Ambari server, which is made up of a Postgres database instance as indicated in the diagram. The Ambari agent is installed on each computer in the cluster, and the Ambari server manages each host through it.

An Ambari agent is a member of the host that delivers heartbeats from the nodes to the Ambari server, as well as numerous operational metrics, to determine the nodes' health condition.

Ambari Web UI is a client-side JavaScript application that performs cluster operations by regularly accessing the Ambari RESTful API. Furthermore, using the RESTful API, it facilitates asynchronous communication between the application and the server.

 

Ques: 7). Apache Ambari supports how many layers of Hadoop components, and what are they?

Answer: 

Apache Ambari supports three tiers of Hadoop components, which are as follows:

1. Hadoop core components

  • Hadoop Distributed File System (HDFS)
  • MapReduce

2. Essential Hadoop components

  • Apache Pig
  • Apache Hive
  • Apache HCatalog
  • WebHCat
  • Apache HBase
  • Apache ZooKeeper

3. Components of Hadoop support

  • Apache Oozie
  • Apache Sqoop
  • Ganglia
  • Nagios

 

Ques: 8). What different sorts of Ambari repositories are there?

Answer: 

Ambari Repositories are divided into four categories, as below:

  1. Ambari: Ambari server, monitoring software packages, and Ambari agent are all stored in this repository.
  2. HDP-UTILS: The Ambari and HDP utility packages are stored in this repository.
  3. HDP: Hadoop Stack packages are stored in this repository.
  4. EPEL (Enterprise Linux Extra Packages): The Enterprise Linux repository now includes an extra set of software.

 

Ques: 9). How can I manually set up a local repository?

Answer:

When there is no active internet connection available, this technique is used. Please follow the instructions below to create a local repository:

1. First and foremost, create an Apache httpd host.
2. Download a Tarball copy of each repository's entire contents.
3. After it has been downloaded, the contents must be extracted.

 

Ques: 10). What is a local repository, and when are you going to utilise one?

Answer:

A local repository is a hosted place for Ambari software packages in the local environment. When the enterprise clusters have no or limited outbound Internet access, this is the method of choice.

 

Ques: 11). What are the benefits of setting up a local repository?

Answer: 

First and foremost by setting up a local repository, you can access Ambari software packages without internet access. Along with that, you can achieve benefits like –

Enhanced governance with better installation performance

Routine post-installation cluster operations like service start and restart operations

 

Ques: 12). What are the new additions in Ambari 2.6 versions?

Answer:

Ambari 2.6.2 added the following features:

  • It will protect Zeppelin Notebook SSL credentials
  • We can set appropriate HTTP headers to use Cloud Object Stores with HDP
  • Ambari 2.6.1 added the following feature:
  • Conditional Installation of  LZO packages through Ambari
  • Ambari 2.6.0 added the following features:
  • Distributed mode of Ambari Metrics System’s (AMS) along with multiple Collectors
  • Host Recovery improvements for the restart
  • moving masters with minimum impact and scale testing
  • Improvement in Data Archival & Purging in Ambari Infra

 

Ques: 13). List Out The Commands That Are Used To Start, Check The Progress And Stop The Ambari Server?

Answer :

The following are the commands that are used to do the following activities:

To start the Ambari server

ambari-server start

To check the Ambari server processes

ps -ef | grep Ambari

To stop the Ambari server

ambari-server stop

 

Ques: 14). What all tasks you can perform for managing host using Ambari host tab?

Answer: 

Using Hosts tab, we can perform the following tasks:

  • Analysing Host Status
  • Searching the Hosts Page
  • Performing Host related Actions
  • Managing Host Components
  • Decommissioning a Master node or Slave node
  • Deleting a Component
  • Setting up Maintenance Mode
  • Adding or removing Hosts to a Cluster
  • Establishing Rack Awareness

 

Ques: 15). What all tasks you can perform for managing services using Ambari service tab?

Answer: 

Using Services tab, we can perform the following tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings change
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS management
  • Atlas management in a Storm Environment

 

Ques: 16). Is there a relationship between the amount of free RAM and disc space required and the number of HDP cluster nodes?

Answer: 

Without a doubt, it has. The amount of RAM and disc required depends on the number of nodes in your cluster. In typically, 1 GB of memory and 10 GB of disc space are required for each node. Similarly, for a 100-node cluster, 4GB of memory and 100GB of disc space are required. To get all of the details, you'll need to look at a specific version.

 

Ques: 17). What tasks you can skill for managing services using the Ambari subsidiary bank account?

Answer: 

using the Services report, we can do the bearing in mind tasks:

  • Start and Stop of All Services
  • Display of Service Operating Summary
  • Adding a Service
  • Configuration Settings regulate
  • Performing Service Actions
  • Rolling Restarts
  • Background Operations monitoring
  • Service removal
  • Auditing operations
  • Using Quick Links
  • YARN Capacity Scheduler refresh
  • HDFS presidency
  • Atlas approach in a Storm Environment

 

Ques: 18). What is the best method for installing the Ambari agent on all 1000 hosts in the HDP cluster?

Answer: 

Because the cluster contains 1000 nodes, we should not manually install the Ambari agent on each node. Instead, we should set up a password-less ssh connection between the Ambari host and all of the cluster's nodes. To remotely access and install the Ambari Agent, Ambari Server hosts employ SSH public key authentication.

 

Ques: 19). What can I do if I have admin capabilities in Ambari?

Answer: 

Becoming a Hadoop Administrator is a difficult job. On HadoopExam.com, you can find all of the available Hadoop Admin training for HDP, Cloudera, and other platforms (visit now). You can create a cluster, manage the users in that cluster, and create groups if you are an Ambari Admin. All of these permissions are granted to the default admin user. You can grant the same or different permissions to another user even if you are an Amabari administrator.

 

Ques: 20).  How is recovery achieved in Ambari?

Answer:

Recovery happens in Ambari in the moreover ways:

Based in remarks to activities

In Ambari after a restart master checks for pending undertakings and reschedules them previously all assimilation out is persisted here. Also, the master rebuilds the come clean machines at the back there is a restart, as the cluster market is persisted in the database. While lawsuit beautifies master actually catastrophe in the in front recording their take keep busy, along amid there is a race condition. The events, on the other hand, should be idempotent, which is a unique consideration. And the master restarts any behavior that has not been marked as occurring or has failed in the database. These persistent behaviors are seen in Redo Logs.

Based approaching the desired make known

While the master attempts to make the cluster flesh and blood publicise, you will be encircled by more to in as per the intended freshen appendix, as the master persists in the desired own going in savings account to for of the cluster.