Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

April 21, 2022

Top 20 AWS DynamoDB Interview Questions and Answers


        Amazon Amazon Web Services' DynamoDB service is a proprietary NoSQL database service. It uses synchronous replication across many data centres to handle key-value and document data structures for all of its data services. DynamoDB has a high level of durability and availability as a result of this. Users of DynamoDB can additionally take advantage of Auto Scaling. DynamoDB automatically scales the database when it is enabled. DyanamoDb's primary data structures are hashing and B-trees. Data is first distributed using hashing into multiple divisions in the first entry. DynamoDB is compatible with a variety of languages, including Java, JavaScript, Node.js, Go, C#.NET, Perl, PHP, Python, Ruby, and many others.

AWS RedShift Interview Questions and Answers

Ques. 1): What are the main advantages of DynamoDB?


DynamoDB offers us a number of advantages, including consistency, availability, and partition tolerance. It provides facilities to easily store the graph data, which is not available to SQL.

Below are the major advantages of DynamoDB

Scalable: Virtual unlimited storage, users can store infinity amount of data according to their need

Cost Effective: It seems to be cutting costs, while a big part of data is able to migrate from SQL to NOSQL. Basically it charges for reading, writing, and storing data along with any optional features you choose to enable in DynamoDB

AWS ActiveMQ Interview Questions and Answers

Data Replication: All data items are stored on SSDs and replication is managed internally across multiple availability zones in a region or can be made available across multiple regions.

DynamoDB is a serverless database that scales horizontally by distributing a single table over several servers.

Easy Administration: Because Amazon DynamoDB is a fully managed service, you won't have to worry about hardware or software provisioning, setup and configuration, software patching, or partitioning data across several instances as you scale.

Secure: Customizable traffic filtering, Regulatory Compliance Automation, Database Threat Detection, and Advanced Notification and Reporting System

AWS Cloud Practitioner Essentials Questions and Answers

Ques. 2): What does the DynamoDBMapper class entail?


The DynamoDBMapper class provides access to Amazon DynamoDB. It gives you access to a DynamoDB API and lets you access your data across several tables. It also allows you to run queries and scans against tables, as well as perform different create, read, update, and delete (CRUD) activities on objects.

AWS EC2 Interview Questions and Answers

Ques. 3): What are disadvantages of DynamoDB?


The disadvantages of DynamoDB are as follows-

  • Deployable only on AWS and cannot be installed on individual desktops/servers
  • Queries - Querying data is extremely limited
  • Table Joins- Joins are impossible
  • No Triggers
  • No foreign keys concept to refer to other table items
  • No server side scripts

AWS Lambda Interview Questions and Answers

Ques. 4): Is DynamoDB a SQL database?


DynamoDB is a NoSQL database that can manage structured and semi-structured data, as well as JSON documents. DynamoDB smoothly scales to handle massive volumes of data and a large number of users.

SQL is the industry standard for data storage and retrieval. Although relational databases include a variety of tools for facilitating the building of database-driven applications, they always employ SQL.

AWS Cloud Security Interview Questions and Answers

Ques. 5): What distinguishes Amazon DynamoDB from Amazon Aurora?


Aurora is a relational database service, whereas DynamoDB is a NoSQL database service.

For data manipulation and retrieval, Aurora employs SQL, but DynamoDB uses a unique syntax.

In Aurora, horizontal partitioning is used, whereas in DynamoDB, sharding is used.

DynamoDB implements key-value and document models, while Aurora uses a relational database management system.

DynamoDB does not support server-side scripting, although Aurora does.

AWS Simple Storage Service (S3) Interview Questions and Answers

Ques. 6): What are the data types supported by DynamoDB?


DynamoDB supports a large set of data types for table attributes. Each data type falls into one of the three following categories -

  • Scalar - These types represent a single value, and include number, string, binary, Boolean, and null.
  • Document - These types represent a complex structure possessing nested attributes, and include lists and maps.
  • Set - These types represent multiple scalars, and include string sets, number sets, and binary sets.

AWS Fargate Interview Questions and Answers

Ques. 7): What is the DynamoDB Query functionality and how does it work?


In DynamoDB, you have two choices for accessing data from collections: Query and Scan. Scan searches the whole database for records that match the criteria, whereas Query performs a direct lookup for a specified data set based on key restrictions.

In addition to the main key, DynamoDB uses a global secondary key, a local secondary key, and a partition primary key to improve read/write speed.

As a result, it is faster and more efficient than the DynamoDB Scan function, and it is recommended for most data retrieval applications.

AWS SageMaker Interview Questions and Answers

Ques. 8): How does DynamoDB protect data from being lost?


DynamoDB has long-term storage and a two-tier backup strategy to prevent data loss to a minimum. Each participant has three nodes, each of which holds the same data from the partition. In addition, a B tree is used to locate data, and a replication log is used to track changes in each node. DynamoDB takes snapshots of these and keeps them for a month in another AWS database in case data restoration is required.

AWS Cloudwatch interview Questions and Answers

Ques. 9): What are DynamoDB's secondary indexes?


A secondary index is a data structure that stores a portion of a table's properties, as well as an alternate key for Query operations. Query can be used to retrieve data from the index in the same way it can be used to retrieve data from a table.

AWS Elastic Block Store (EBS) Interview Questions and Answers

Ques. 10): What does Data Pipeline do?


Data Pipeline exports and imports data from S3 bucket, file, etc.It also helps in backups, testing, and for similar needs.

There are 2 types Data Pipeline:

DataPipelineDefaultRole - contains all the ations you permit the pipeline to perform.

DataPipelineDefaultResourceRole - contains all the resources you permit the pipeline to perform.

AWS Amplify Interview Questions and Answers

Ques. 11): List some important methods of Dynamo DB Mapper Class?


  • Save: Saves a particular object to the table.
  • Load: Retrieves an item from the table
  • Delete: It deletes an item from the table
  • Query: Queries a table or an index
  • Scan page: Scans a table or index and returns a matched result page.
  • Parallel scan: Performs a scan of the entire table or index.
  • Batch save: Saves object to one or more tables.

AWS Cloud Interview Questions and Answers Part - 1

Ques. 12): Is it safe to use DynamoDBMapper in a thread?


This class is thread-safe and can be shared across many threads. DynamoDBMapper will throw DynamoDBMappingException while using the save, load, and delete methods to indicate that domain classes are wrongly annotated or otherwise incompatible with this class.

AWS Cloud Interview Questions and Answers Part - 2

Ques. 13): Are DynamoDB's write operations atomic?


DynamoDB supports atomic counters, which allow you to add or decrement the value of an existing attribute without interfering with other write requests by using the update method. This attribute is incremented by one each time the application is run.

AWS Secrets Manager Interview Questions and Answers

Ques. 14): Does Amazon DynamoDB support conditional operations?


For an operation to be completed on an item, you have to specify a condition.

You can define a ConditionExpression that can be constructed from the following:


Comparison operators: =, <>, <, >, <=, >=, BETWEEN, and IN

Logical operators: NOT, AND, and OR.

You can also construct a free-form conditional expression that combines multiple conditional clauses which also includes nested clauses.

Top 20 AWS Django Interview Questions and Answers

Ques. 15): What are the differences between Amazon SimpleDB and Amazon DynamoDB?


Amazon DynamoDB: It is a highly recommended fast and scalable NoSQL Database Service that is designed for internet scale applications, maintains predictable high performance, and is highly cost-effective for workloads of any scale.

Amazon SimpleDB is an excellent fit for lower workloads that require query flexibility, but it has scaling limits.

It indexes all item attributes automatically and allows for query flexibility at the expense of performance and scale.

AWS Cloud Support Engineer Interview Question and Answers

Ques. 16): In Amazon DynamoDB, how do you delete a Global Secondary Index?


The console or an API request can be used to delete a Global secondary index.

On the console, choose the table from which you wish to delete the Global Secondary index, then go to the "indexes" tab under "Table items," select the "indexes" tab, and then click the "Deletion" button next to the delete index.

The Update Table API call can also be used to delete a Global Secondary Index.

AWS Solution Architect Interview Questions and Answers

Ques. 17): What is use of Scan operation in DynamoDB?


A Scan operation in Amazon DynamoDB reads every item in a table or a secondary index. By default, a Scan operation returns all of the data attributes for every item in the table or index. You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them.

AWS Aurora Interview Questions and Answers

Ques. 18): What type of query capabilities does DynamoDB provide?


GET/PUT operations in DynamoDB are supported by a user-defined primary key. The primary key is the sole attribute that objects in a database must have. When you construct a table, you specify the primary key, which uniquely identifies each item. DynamoDB also allows for flexible querying by allowing you to use global secondary indexes and local secondary indexes to query on nonprimary key properties.

AWS DevOps Cloud Interview Questions and Answers

Ques. 19): What are DynamoDB streams, and how do you use them?


DynamoDB Streams is a robust service that may be used in conjunction with other AWS services to solve a variety of challenges. When we enable DynamoDB Streams, it records a time-ordered sequence of item-level alterations in a DynamoDB table and saves the data for up to 24 hours.

AWS(Amazon Web Services) Interview Questions and Answers

Ques. 20): What are projections and how do they work?


The set of properties that are transferred or projected from a table to an index is known as projections.

They're in addition to the index key and main key qualities, which are projected automatically. When creating a local secondary index, you must first define the properties that will be projected into the index. Each index has a minimum of three properties, which are:

  • The value of the table partition key
  • The attribute that will be used as the index sort key.
  • Sort by key value in the table

AWS Database Interview Questions and Answers

November 17, 2021

Top 20 Apache Spark Interview Questions & Answers


Ques: 1). What is Apache Spark?


Apache Spark is an open-source real-time processing cluster computing framework. It has a vibrant open-source community and is now the most active Apache project. Spark is a programming language that allows you to programme large clusters with implicit data parallelism and fault tolerance.

Spark is one of the Apache Software Foundation's most successful projects. Spark has unquestionably established itself as the market leader in Big Data processing. Spark is used by many enterprises on clusters with thousands of nodes. Spark is now used by large companies like as Amazon, eBay, and Yahoo!

BlockChain Interview Question and Answers

Ques: 2). What advantages does Spark have over MapReduce?


Compared to MapReduce, Spark has the following advantages:

Spark implements processing 10 to 100 times quicker than Hadoop MapReduce due to the availability of in-memory processing, whereas MapReduce uses persistence storage for any of the data processing activities.

Unlike Hadoop, Spark has built-in libraries that allow it to do a variety of functions from the same core, including as batch processing, steaming, machine learning, and interactive SQL queries. Hadoop, on the other hand, only supports batch processing.

Hadoop is heavily reliant on discs, but Spark encourages caching and data storage in memory. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

Apache Hive Interview Questions & Answers

Ques: 3). What exactly is YARN?


YARN is a fundamental element in Spark, similar to Hadoop, in that it provides a central and resource management platform for delivering scalable operations across the cluster. YARN, like Mesos, is a distributed container manager, but Spark is a data processing tool. Spark can be run on YARN in the same way that Hadoop Map Reduce can. Running Spark on YARN needs the use of a Spark binary distribution with YARN support.

 Apache Ambari interview Questions & Answers

Ques: 4). What's the difference between an RDD, a Dataframe, and a Dataset?


Resilient Distributed Dataset (RDD) - RDD stands for Resilient Distributed Dataset. It is the most basic data structure in Spark, consisting of an immutable collection of records partitioned among cluster nodes. It allows us to do fault-tolerant in-memory calculations on massive clusters.

RDD, unlike DF and DS, will not keep the schema. It merely stores data. If a user wants to apply a schema to an RDD, they must first build a case class and then apply the schema to the data.

We will use RDD for the below cases:

-When our data is unstructured, A streams of text or media streams.

-When we don’t want to implement any schema.

-When we don’t care about the column name attributes while processing or accessing.

-When we want to manipulate the data with functional programming constructs than domain specific expressions.

-When we want low-level transformation, actions and control on the dataset.


-Like RDD DataFrames are immutable collection of data.

-Unlike RDD DataFrame will have schema for their data making user to easily access/process large set of data which is distributed among the nodes of cluster.

-DataFrame provides a domain specific language API to manipulate distributed data and makes Spark accessible to a wider audience, beyond specialized data engineers.

-From Spark 2.x Spark DataFrames are nothing but Dataset[Row] or alias (Untyped API)

consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object


Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. To implement case class on RDD and use as Dataset[T].

 Apache Tapestry Interview Questions and Answers

Ques: 5). Can you explain how you can use Apache Spark along with Hadoop?


Apache Spark provides the benefit of being Hadoop compatible. They make a powerful tech team when they work together. Using Apache Spark and Hadoop combines the processing capability of Spark with the best capabilities of HDFS and YARN. The following are some examples of how to use Hadoop Components with Apache Spark:

Batch & Real-Time Processing – MapReduce and Spark can work together, where the former handles batch processing, and the latter handles real-time processing.

HDFS – Spark can make use of the HDFS to leverage the distributed replicated storage.

MapReduce – Apache Spark can be used in conjunction with MapReduce in a similar Hadoop cluster or independently as a processing framework.

YARN – You can run Spark applications on YARN.

 Apache NiFi Interview Questions & Answers

Ques: 6). What is the meaning of the term "spark lineage"?


• In Spark, regardless of the actual data, all dependencies between RDDs will be logged in a graph. In Spark, this is referred to as a lineage graph.

• RDD stands for Resilient Distributed Dataset, with the term "resilient" referring to fault tolerance. We can re-compute the missing or damaged partition due to node failure using RDD Lineage Graph. When we generate new RDDs based on existing RDDs, we use lineage graph spark to handle the dependencies. Each RDD keeps a pointer to one or more parents, together with metadata describing the type of relationship it has with the parent RDD.

• RDD Lineage Graph in Spark can be obtained using the ToDebugString method.


Ques: 7). List the various components of the Spark Ecosystem.


These are the five types of components in the Spark Ecosystem:

GraphX: Enables graphs and graph-parallel computation.

MLib: It is used for machine learning.

Spark Core: A powerful parallel and distributed processing platform.

Spark Streaming: Handles real-time streaming data.

Spark SQL: Combines Spark's functional programming API with relational processing.


Ques: 8). What is RDD in Spark? Write about it and explain it.


Resilient Distributed Dataset (RDD) is an acronym for Resilient Distributed Dataset. RDDs are a fault-tolerant core data structure in Spark that is immutable. They've disseminated partitioned datasets among the cluster nodes.

Parallelizing and referencing a data set are the two methods for constructing RDDS. Lazy evaluation is the responsibility of the RDDS. The faster processing performance in Spark is due to the lazy evaluation of RDDs.


Ques: 9). In Spark, how does streaming work?


Spark gets data in real time that is separated into batches. The Spark Engine processes these batches of data, and the final stream of results is returned back in batches. DStream, or Discretized Stream, is the most basic stream unit in Spark.


Ques: 10). Is it feasible to access and analyse data stored in Cassandra databases using Apache Spark?


Yes, Apache Spark may be used to retrieve and analyse data stored in Cassandra databases. Apache Spark can access and analyse data contained in Cassandra databases using the Spark Cassandra Connector. Spark should have a functionality that allows Spark executors to communicate with local Cassandra nodes and request just local data.

Cassandra and Apache Spark can be connected to speed up queries by lowering network traffic between Spark executors and Cassandra nodes.


Ques: 11). What are the advantages of using Spark SQL?


Spark SQL carries out the following tasks:

Loads data from a variety of structured datasources, such as a relational database management system (RDBMS).

It may query data using SQL commands within the Spark programme as well as JDBC/ODBC connectors from third-party tools such as Tableau.

It can also provide SQL and Python/Scala code interaction.


Ques: 12). What is the purpose of Spark Executor?


The Executors are obtained on top of worker nodes in the clusters when a SparkContext is formed. Spark Executors are in charge of performing computations and storing data on the worker node. They are also in charge of returning the results to the driver.


Ques: 13). What are the advantages and disadvantages of Spark?


Advantages: Spark is known for real-time data processing, which may be employed in applications such as stock market analysis, finance, and telecommunications.

Spark's stream processing allows for real-time data analysis, which can aid in fraud detection, system alarms, and other applications.

Due to its lazy evaluation mechanism and parallel processing, Spark processes data 10 to 100 times quicker.

Disadvantages: When compared to Hadoop, Spark consumes greater storage space.

The task is distributed over numerous clusters rather than taking place on a single node.

Spark's in-memory processing might be costly when dealing with large amounts of data.

When compared to Hadoop, Spark makes better use of data.


Ques: 14). What are some of the drawbacks of utilising Apache Spark?


The following are some of the drawbacks of utilising Apache Spark:

There is no file management system built-in. To take benefit of a file management system, integration with other platforms such as Hadoop is essential.

Higher latency, but lower throughput as a result

It does not support the processing of real-time data streams. In Apache Spark, live data streams are partitioned into batches, which are then processed and turned back into batches. To put it another way, Spark Streaming is more like micro-batch data processing than true real-time data processing.

There are fewer algorithms available.

Record-based window requirements are not supported by Spark streaming. It is necessary to distribute work across multiple clusters instead of running everything on a single node.

Apache Spark's in-memory ability becomes a bottleneck when used for the cost-efficient processing of big data.


Ques: 15). Is Apache Spark compatible with Apache Mesos?


Yes. Spark can work on Apache Mesos-managed clusters, just as it works on YARN-managed clusters. Spark may run without a resource manager in standalone mode. If it has to execute on multiple nodes, it can use YARN or Mesos.


Ques: 16). What are broadcast variables, and how do they work?


Accumulators and broadcast variables are the two types of shared variables in Spark. Instead of shipping back and forth to the driver, the broadcast variables are read-only variables cached in the Executors for local referencing. A broadcast variable preserves a read-only cached version of a variable on each computer instead of delivering a copy of the variable with tasks.

Additionally, broadcast variables are utilised to distribute a copy of a big input dataset to each node. To cut transmission costs, Apache Spark distributes broadcast variables using efficient broadcast algorithms.

There is no need to replicate variables for each task when using broadcast variables. As a result, data can be processed quickly. In contrast to RDD lookup(), broadcast variables assist in storing a lookup table inside the memory, enhancing retrieval efficiency.


Ques: 17). In Apache Spark, how does caching work?


Caching RDDs in Spark speeds up processing by allowing numerous accesses to the same RDD. The function of Discretized Streams, or DStreams, in Spark streaming is to allow users to cache or retain data in memory.

The functions cache () and persist(level) are used to cache data in memory and cache memory based on the storage level specified, respectively.

The persist () without the level specifier is the same as cache, which means it caches the data in memory. The persist(level) method caches data at the provided storage level, such as on disc, on the server, or in off-heap memory.


Ques: 18).  What exactly is Akka? What does Spark do with it?


Akka is a Scala and Java framework for reactive, distributed, parallel, and robust concurrent applications. Akka is the foundation for Apache Spark.

When assigning tasks to worker nodes, Spark employs Akka for job scheduling and messaging between the master and the worker node.


Ques: 19). What applications do you utilise Spark streaming for?


When real-time data must be streamed into the Spark programme, this method is employed. It can be broadcast from a variety of places, such as Kafka, Flume, Amazon Kinesis, and others. For processing, the streamed data is separated into batches.

Spark streaming is used to conduct real-time sentiment analysis of customers on social media sites like as Twitter and Facebook, among others.

Live streaming data processing is critical for detecting outages, detecting fraud in financial institutions, and making stock market predictions, among other things.


Ques: 20). What exactly do you mean when you say "lazy evaluation"?


The way Spark works with data is intellectual. When you ask Spark to perform a task on a dataset, it follows your instructions and records them so that it doesn't forget them - but it doesn't do anything until you tell it to. When map() is invoked on an RDD, the operation is not done immediately. Transformations aren't evaluated by Spark until you use them. This aids in the overall data processing workflow optimization.

Top 20 Apache Hive Interview Questions & Answers

Ques: 1). What is Apache Hive, and how does it work?


Apache Hive is a Hadoop-based, sophisticated warehouse project. This platform focuses on data analysis and includes data query capabilities. Hive is comparable to SQL in that it provides a user interface for querying data stored in files and database systems. And Apache Hive is a popular data analysis and querying technology used by Fortune 500 companies around the world. When it is cumbersome or inefficient to run the logic in HiveQL, Hive allows standard map reduce programmes to customise mappers and reducers (User Defined Functions UDFS).


BlockChain Interview Question and Answers

Ques: 2). What is the purpose of Hive?


Hive is a Hadoop tool that allows you to organise and query data in a database-like format, as well as write SQL-like queries. It can be used to access and analyse Hadoop data using SQL syntax.

 Apache Ambari interview Questions & Answers

Ques: 3). What are the differences between local and remote meta stores?


Local meta store: When using the Local Meta store configuration, the specified meta store service, as well as the Hive service, will run on the same Java Virtual Machine (JVM) and connect to databases that are operating in distinct JVMs, either on the same machine or on a remote machine.

Remote meta store: The Meta store service and the Apache Hive service will execute on distinct JVMs in the Remote Meta store. To connect to meta store servers, all other processes use Thrift Network APIs. You can have many meta store servers in Remote meta store for high availability.

Apache Tapestry Interview Questions and Answers

Ques: 4). Explain the core difference between the external and managed tables?


The following are the fundamental distinctions between managed and external tables:

When a managed table is dropped, the complete metadata and table data is lost. The Hive just deletes the metadata information associated with a table and leaves the table data in HDFS, whereas the external table is quite different.

Tables that are managed and tables that are external. Hive manages the data by default when you create a table, which means it moves the data into its warehouse directory. Alternatively, you can construct an external table, which instructs Hive to refer to data stored somewhere other than the warehouse directory.

The semantics of LOAD and DROP show the difference between the two table types. Let's start with a managed table. Data loaded into a managed table is stored in Hive's warehouse directory.

 Apache NiFi Interview Questions & Answers

Ques: 5). What is the difference between a read-only schema and a write-only schema?


A table's schema is enforced at data load time in a conventional database. The data being loaded is rejected if it does not conform to the schema. Because the data is validated against the schema when it is written into the database, this architecture is frequently referred to as schema on write.

Hive, on the other hand, verifies data when it is loaded, rather than when it is queried. This is referred to as schema on read.

Between the two approaches, there are trade-offs. Because the data does not have to be read, parsed, and serialized to disc in the database's internal format, schema on read allows for a very quick first load. A file copy or move is all that is required for the load procedure. It's also more adaptable: think of having two schemas for the same underlying data, depending on the analysis. (External tables can be used in Hive for this; see Managed Tables and External Tables.)

Because the database can index columns and compress the data, schema on write makes query time performance faster. However, it takes longer to load data into the database as a result of this trade-off. Furthermore, in many cases, the schema is unknown at load time, thus no indexes can be applied because the queries have not yet been formed. Hive really shines in these situations.

 Apache Spark Interview Questions & Answers

Ques: 6). Write a query to insert a new column? Can you add a column with a default value in Hive?


ALTER TABLE test1 ADD COLUMNS (access_count1 int); You cannot add a column with a default value in Hive. The addition of the column has no effect on the files that support your table. Hive interprets NULL as the value for every cell in that column in order to deal with the "missing" data.

In Hive, you must effectively recreate the entire table, this time with the column filled in. It's possible that rerunning your original query with the additional column will be easier. Alternatively, you might add the column to the table you already have, then select all of its columns plus the new column's value.

Ques: 7). What is the purpose of Hive's DISTRIBUTED BY clause?


DISTRIBUTE BY determines how map output is split between reducers. By default, MapReduce computes a hash on the keys output by mappers and uses the hash values to try to distribute the key-value pairs evenly among the available reducers. Let's say we want all of the data for each value in a column to be collected at the same time. To ensure that the records for each get to the same reducer, we can use DISTRIBUTE BY. In the same way that GROUP BY determines how reducers receive rows for processing, DISTRIBUTE BY does the same.

If the DISTRIBUTE BY and SORT BY clauses are in the same query, Hive expects the DISTRIBUTE BY clause to be before the SORT BY clause. When you have a memory-intensive job, DISTRIBUTE BY is a helpful workaround because it requires Hadoop to employ Reducers instead of having a Map-only job. Essentially, Mappers gather data depending on the DISTRIBUTE BY columns supplied, reducing the framework's overall workload, and then transmit these aggregates to Reducers.


Ques: 8). What occurs when you perform a query in HIVE, please?


The Query Planner examines the query and turns it to a Hadoop Map Reduce job’s DAG (Directed Acyclic Graph).

The jobs are submitted to the Hadoop cluster in the order that the DAG suggests.

Only mappers are used for simple queries. The Input Output format is in charge of splitting an input and reading data from HDFS. After that, the data is sent to a layer called SerDe (Serializer Deserializer). The deserializer part of the SerdDe converts data as a byte stream to a structured format in this example.

Reducers will be included in Map Reduce jobs for aggregate queries. In this case, the serializer of the SerDe converts structured data to byte stream which gets handed over to the Input Output format which writes it to the HDFS.


Ques: 9). What is the importance of STREAM TABLE?


When you need information from several tables, joins are useful, but when you have 1.5 billion or more data in one table and want to link it to a master table, the order of the joining tables is crucial.

Consider the following scenario: 

select foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a; 

Because Hive streams the right-most table (bar) and buffers other tables (foo) in memory before executing map-side/reduce-side joins. As a result, if you buffer 1.5 billion or more records, your join query will fail since 1.5 billion records will very certainly fill up Java-Heap space exception. 

So, to overcome this limitation and free the user to remember the order of joining tables based on their record-size, Hive provides a key-word /*+ STREAMTABLE(foo) */ which tells Hive Analyzer to stream table foo.

select /*+ STREAMTABLE(foo) */ foo.a,foo.b,bar.c from foo join bar on foo.a=bar.a;

Hence, in this way user can be free of remembering the order of joining tables.


Ques: 10). When is it appropriate to use SORT BY instead of ORDER BY?


When working with huge volumes of data in Apache Hive, we use SORT BY instead of ORDER BY. The fact that SORT BY comes with numerous reducers is one of the reasons for utilising it. This cuts down on the amount of time it takes to complete the task. ORDER BY, on the other hand, consists of only one reduce, which means the process takes longer than usual to complete.


Ques: 11). What is the purpose of Hive's Partitioning function?


Partitioning allows users to arrange data in the Hive table in the way they want it. As a result, the system would be able to scan only the relevant data rather than the complete data set.

Consider the following scenario: Assume we have transaction log data from a business website for years such as 2018, 2019, 2020, and so on. So, in this case, you can utilise the partition key to find data for a specified year, say 2019, which will reduce data scanning by removing 2018 and 2020.


Ques: 12). What is dynamic partitioning and how does it work?


The values of partition columns are exposed during runtime in dynamic partitioning, i.e. the values are known when you load data into Hive tables. The following are some examples of how dynamic partitioning is commonly used:

To move data from a non-partitioned table to a partitioned table, which reduces latency and improves sampling.


Ques: 13). In hive, what's the difference between dynamic and static partitioning?


Hive partitioning is highly beneficial for pruning data during queries in order to reduce query times.

When data is inserted into a table, partitions are produced. Partitions are required depending on how data is loaded. When loading files (especially large files) into Hive tables, static partitions are usually preferred. When compared to dynamic partition, this saves time when loading data. You "statically" create a partition in the table and then move the file into that partition. 

Because the files are large, they are typically created on HDFS. Without reading the entire large file, you can retrieve the partition column value from the filename, date, and so on. In the case of dynamic partitioning, the entire large file is read, i.e. every row of data is read, and the data is partitioned into the target tables using an MR job based on specified fields in the file.

Dynamic partitions are typically handy when doing an ETL operation in your data pipeline. For example, suppose you use the transfer command to load a large file into Table X. Then you run an idle query into Table Y and split data based on table X fields such as day and country. You could wish to execute an ETL step to partition the data in Table Y's nation partition into a Table Z where the data is partitioned based on cities for a specific country alone, and so on.

Thus depending on your end table or requirements for data and in what form data is produced at source you may choose static or dynamic partition.


Ques: 14).What is ObjectInspector in Hive?


The ObjectInspector is a feature that allows us to analyze individual columns and internal structure of a row object in Hive. This also provides a seamless way to access complex objects that can be stored in varied formats in the memory.

  • A standard Java object
  • An instance of the Java class
  • A lazily initialized object

The ObjectInspector lets the users know the structure of an object and also helps in accessing the internal fields of an object.


Ques: 15). How does impala outperform hive in terms of query response time?


Impala should be thought of as "SQL on HDFS," whereas Hive is more "SQL on Hadoop."

Impala, in other words, does not require Hadoop at whatsoever. It simply runs daemons on all of your nodes that store some of the data in HDFS, allowing these daemons to return data rapidly without having to conduct a full Map/Reduce process.

The rationale for this is that running a Map/Reduce operation has some overhead, so short-circuiting Map/Reduce completely can result in a significant reduction in runtime.

That stated, Impala is not a replacement for Hive; it is useful in a variety of situations. When compared to Hive, Impala does not support fault-tolerance, therefore if there is a problem during your query, it will be gone. I would recommend Hive for ETL processes where a single job failure would be costly, but Impala can be great for tiny ad-hoc queries, such as for data scientists or business analysts who just want to look at and study some data without having to develop substantial jobs.


Ques: 16). Explain the different components used in the Hive Query processor?


Below mentioned is the list of Hive Query processors:

  • Metadata Layer (ql/metadata)
  • Parse and Semantic Analysis (ql/parse)
  • Map/Reduce Execution Engine (ql/exec)
  • Sessions (ql/session)
  • Type Interfaces (ql/typeinfo)
  • Tools (ql/tools)
  • Hive Function Framework (ql/udf)
  • Plan Components (ql/plan)
  • Optimizer (ql/optimizer)


Ques: 17). What is the difference between Hadoop Buffering and Hadoop Streaming?


Using custom made python or shell scripts to implement your map-reduce logic is known as Hadoop Streaming. (Use the Hive TRANSFORM keyword, for example.)

In this context, Hadoop buffering refers to the phase in a map-reduce job of a Hive query with a join when records are read into the reducers after being sorted and grouped by the mappers. The author explains why you should order the join clauses in a Hive query so that the largest tables come last; this helps Hive implement joins more efficiently.


Ques: 18). How will the work be optimised by the map-side join?


Let's pretend we have two tables, one of which is a little table. A Map Reduce local job will be generated before the original join Map Reduce task, which will read data from HDFS and put it into an in-memory hash table. It serialises the in-memory hash table into a hash table file after reading it.

The data in the hash table file is then moved to the Hadoop distributed cache, which populates these files to each mapper's local disc in the following stage, while the original join Map Reduce process is running. As a result, all mappers can reload this permanent hash table file into memory and perform the join operations as previously. 

The optimised map join's execution sequence is depicted in the diagram below. The short table just has to be read once after optimization. In addition, if many mappers are operating on the same system, the distributed cache only needs to send a single copy of the hash table file to this machine.

Advantages of using Map-side join:

Using Map-side join reduces the cost of sorting and combining data in theshuffle and reduces stages. The map-side join also aids task performance by reducing the time it takes to complete the assignment.

Disadvantages of Map-side join:

It is only suitable for use when one of the tables on which the map-side join operation is performed is small enough to fit into memory. As a result, performing a map-side join on tables with a lot of data in each of them isn't a good idea.


Ques: 19).What type of user defined functions exists in HIVE?


A UDF operates on a single row and produces a single row as its output. Most functions, such as mathematical functions and string functions, are of this type.

A UDF must satisfy the following two properties:

  • A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
  • A UDF must implement at least one evaluate() method.


A UDAF works on multiple input rows and creates a single output row. Aggregate functions include such functions as COUNT and MAX.
  • A UDAF must satisfy the following two properties:
  • A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF;
  • An evaluator must implement five methods:
    • init()
    • iterate()
    • terminatePartial()
    • merge()
    • terminate()

  • A UDTF operates on a single row and produces multiple rows — a table — as output.
  • A UDTF must be a subclass of org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
  • A custom UDTF can be created by extending the GenericUDTF abstract class and then implementing the initialize, process, and possibly close methods.
  • The initialize method is called by Hive to notify the UDTF the argument types to expect.
  • The UDTF must then return an object inspector corresponding to the row objects that the UDTF will generate.
  • Once initialize() has been called, Hive will give rows to the UDTF using the process() method.
  • While in process(), the UDTF can produce and forward rows to other operators by calling forward().
  • Lastly, Hive will call the close() method when all the rows have passed to the UDTF.


Ques: 20). Is the HIVE LIMIT clause truly random?


Although the manual claims that it returns rows at random, this is not the case. Without any where/order by clause, it returns "selected rows at random" as they occur in the database. This doesn't imply it's truly random (or randomly picked), but it does suggest that the order in which the rows are returned can't be predicted.

It returns the last 5 rows of whatever you're picking from as soon as you slap an order by x DESC limit 5 on there. You'd have to use something like order by rand() LIMIT 1 to get rows returned at random.

However, if your indexes aren't set up correctly, it can slow things down. I usually do a min/max to get the IDs on the table, then a random number between them, then choose those records (in your instance, just one), which is usually faster than letting the database do the work, especially on a huge dataset.

December 27, 2019

Top 20 SQL Server Interview Questions and Answers

Ques: 1. What is SQL server agent?


The SQL Server agent plays a vital role in day to day tasks of SQL server administrator (DBA). Server agent's purpose is to implement the tasks easily with the scheduler engine which allows our jobs to run at scheduled date and time.

Oracle Fusion Applications interview Questions and Answers

Ques: 2. What is a Trigger?


Triggers are used to execute a batch of SQL code when insert or update or delete commands are executed against a table. Triggers are automatically triggered or executed when the data is modified. It can be executed automatically on insert, delete and update operations.

Oracle Accounts Payables Interview Questions and Answers


Ques: 3. What is the use of SET NOCOUNT ON/OFF statement?


By default, NOCOUNT is set to OFF and it returns number of records got affected whenever the command is getting executed. If the user doesn't want to display the number of records affected, it can be explicitly set to ON- (SET NOCOUNT ON).

Oracle ADF Interview Questions and Answers


Ques: 4. What is SQL injection?


SQL injection is an attack by malicious users in which malicious code can be inserted into strings that can be passed to an instance of SQL server for parsing and execution. All statements have to checked for vulnerabilities as it executes all syntactically valid queries that it receives.

Oracle Access Manager Interview Questions and Answers


Ques: 5. What will be the maximum number of index per table?

Ans: For SQL Server 2008 100 Index can be used as maximum number per table. 1 Clustered Index and 999 Non-clustered indexes per table can be used in SQL Server.

1000 Index can be used as maximum number per table. 

1 Clustered Index and 999 Non-clustered indexes per table can be used in SQL Server.

Oracle Fusion HCM Interview Questions and Answers

Ques: 6. What is Filtered Index?


Filtered Index is used to filter some portion of rows in a table to improve query performance, index maintenance and reduces index storage costs. When the index is created with WHERE clause, then it is called Filtered Index


Oracle SCM Interview Questions and Answers

Ques: 7. List the different index configurations possible for a table?


A table can have one of the following index configurations:

  • No indexes 
  • A clustered index 
  • A clustered index and many non-clustered indexes 
  • A non-clustered index 
  • Many non-clustered indexes

Oracle Financials Interview questions and Answers

Ques: 8. What is sub query and its properties?


A sub-query is a query which can be nested inside a main query like Select, Update, Insert or Delete statements. This can be used when expression is allowed. Properties of sub query can be defined as

  • A sub query should not have order by clause.
  • A sub query should be placed in the right hand side of the comparison operator of the main query. 
  • A sub query should be enclosed in parenthesis because it needs to be executed first before the main query. 
  • More than one sub query can be included.

Oracle Cloud Interview Questions and Answers

Ques: 9. What is Mirroring?


Mirroring is a high availability solution. It is designed to maintain a hot standby server which is consistent with the primary server in terms of a transaction. Transaction Log records are sent directly from the principal server to a secondary server which keeps a secondary server up to date with the principal server.


Oracle PL/SQL Interview Questions and Answers

Ques: 10. What is an execution plan?


An execution plan is a graphical or textual way of showing how the SQL server breaks down a query to get the required result. It helps a user to determine why queries are taking more time to execute and based on the investigation user can update their queries for the maximum result.

In Query Analyzer is an option called “Show Execution Plan” (located on the Query drop-down menu). If this option is turned on it will display a query execution plan in a separate window when a query is run again.


Oracle SQL Interview Questions and Answers

Ques: 11. What is a performance monitor?


Windows performance monitor is a tool to capture metrics for the entire server. We can use this tool for capturing events of the SQL server also.

Some useful counters are – Disks, Memory, Processors, Network, etc.


Oracle RDMS Interview Questions and Answers

Ques: 12. What is the difference between a Local and a Global temporary table?


If defined inside a compound statement a local temporary table exists only for the duration of that statement but a global temporary table exists permanently in the database but its rows disappear when the connection is closed.


BI Publisher Interview Questions and Answers

Ques: 13. What is the SQL Profiler?


SQL Profiler provides a graphical representation of events in an instance of SQL Server for monitoring and investment purpose. We can capture and save the data for further analysis. We can put filters as well to captures the specific data we want.


Oracle 10g Interview Questions and Answers

Ques: 14. What are the properties of the Relational tables?


Relational tables have six properties:

  1. Values are atomic. 
  2. Column values are of the same kind. 
  3. Each row is unique. 
  4. The sequence of columns is insignificant. 
  5. The sequence of rows is insignificant. 
  6. Each column must have a unique name.

BlockChain interview Questions and Answers

Ques: 15. What is View?


A view is a virtual table that contains data from one or more tables. Views restrict data access of the table by selecting only required values and make complex queries easy.

Rows updated or deleted in the view are updated or deleted in the table the view was created with. It should also be noted that as data in the original table changes, so does data in the view, as views are the way to look at part of the original table. The results of using a view are not permanently stored in the database


MySQL Interview Questions and Answers

Ques: 16. Why is replication required on the SQL Server?


Replication is the mechanism that is used to synchronize the data among the multiple servers with the help of a replica set.

This is mainly used to increase the capacity of the reading and to provide an option to its users to select among various servers to perform the read/write operations.


Azure Interview Questions and Answers

Ques: 17. What part does database design have to play in the performance of a SQL Server-based application?


It plays a very major part. When building a new system, or adding to an existing system, it is crucial that the design is correct. Ensuring that the correct data is captured and is placed in the appropriate tables, that the right relationships exist between the tables and that data redundancy is eliminated is an ultimate goal when considering performance. Planning a design should be an iterative process, and constantly reviewed as an application is developed. It is rare, although it should be the point that everyone tries to achieve, when the initial design and system goals are not altered, no matter how slightly. Therefore, a designer has to be on top of this and ensure that the design of the database remains efficient.


Ques: 18. What command is used to create a database in the SQL Server and how?


CREATEDATABASE Command is used to create any database in the SQL Server. Following is the way to use this command:

CREATEDATABASE Name of the Database

Example: If the name of a database is “employee” then create command to create this database that can be written as CREATEDATABASE employee.


Ques: 19. What is an extended stored procedure? Can you instantiate a COM object 

by using T-SQL?


An extended stored procedure is a function within a DLL (written in a programming language like C, C++ using Open Data Services (ODS) API) that can be called from T-SQL, just the way we call normal stored procedures using the EXEC statement.

Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-SQL by using sp_OACreate stored procedure. 


Ques: 20. When should SQL Server-based cursors be used, and not be used?


SQL Server cursors are perfect when you want to work one record at a time, rather than taking all the data from a table as a single bulk. However, they should be used with care as they can affect performance, especially when the volume of data increases. 

From a beginner’s viewpoint, I really do feel that cursors should be avoided every time because if they are badly written, or deal with too much data, they really will impact a system’s performance. There will be times when it is not possible to avoid cursors, and I doubt if many systems exist without them. If you do find you need to use them, try to reduce the number of records to process by using a temporary table first, and then building the cursor from this. The lower the number of records to process, the faster the cursor will finish. Always try to think “out of the envelope”.


More Interview Questions and Answers:


C language Interview Questions and Answers


C++ language Interview Questions and Answers


Machine Learning Interview Questions and Answers


PowerShell Interview Questions and Answers


Python Interview Questions and Answers


Python Pandas Interview Questions and Answers


SQL Server Interview Questions and Answers


Unix interview Questions and Answers


C# Language Interview Questions and Answers


CSS (Cascading Style Sheets ) Interview Questions and Answers


Robotic Process Automation(RPA) Interview Questions and Answers


UX Design Interview Questions and Answers


Docker Interview Questions and Answers


Google Cloud Computing Interview Questions and Answers


Linux Interview Questions and Answers


Data Science Interview Questions and Answers


Edge Computing Interview Questions and Answers


Hadoop Technical Interview Questions and Answers


Hyperion Technical Interview Questions and Answers


Internet of Things (IOT) Interview Questions and Answers


C# Language Interview Questions and Answers