Apache Drill is an open source
software framework that enables the interactive study of huge datasets using
data demanding distributed applications. Drill is the open source version of
Google's Dremel technology, which is provided as a Google Big Query
infrastructure service. HBase, MongoDB, MapR-DB, HDFS, MopEDS, AmazonS3, Google
cloud storage, Swift, NAS, and local files are among the NoSQL databases and
filesystems it supports. Data from various datastores can be combined in a
single query. You may combine a user profile collection in MongoDB with a
directory of Hadoop event logs, for example.
Ques. 1): What is Apache Drill,
and how does it work?
Answer:
Apache Drill is an open-source
SQL engine with no schema that is used to process massive data sets and
semi-structured data created by new age Big data applications. Drill's
plug-and-play interface with Hive and Hbase installations is a great feature.
Google's Dremel file system inspired the Apache Drill. We may have a faster
understanding of data analysis without having to worry about schema
construction, loading, or any other type of maintenance that used to be
required in the RDBMS system. We can easily examine multi-structured data with
Drill.
Apache Drill is a schema-free SQL
Query Engine for Hadoop, NoSQL, and Cloud Storage that allows us to explore,
visualise, and query various datasets without needing to use ETL or other
methods to fix them to a schema.
Apache Drill can also directly
analyse multi-structured and nested data in non-relational data stores, without
any data restrictions.
The schema-free JSON model is
included in Apache Drill, the first distributed SQL query engine and its looks
like -
- Elastic Search
- MongoDB
- NoSQL database
The Apache Drill is very useful
for those professionals that already working with SQL databases and BI tools
like Pentaho, Tableau, and Qlikview.
Also Apache Drill supports to -
- RESTful,
- ANSI SQL and
- JDBC/ODBC drivers
Ques. 2): Is Drill a Good
Replacement for Hive?
Answer:
Hive is a batch processing
framework that is best suited for processes that take a long time to complete.
Drill outperforms Hive when it comes to data exploration and business
intelligence.
Drill is also not exclusive to
Hadoop. It can, for example, query NoSQL databases (such as MongoDB and HBase)
and cloud storage (eg, Amazon S3, Google Cloud Storage, Azure Blob Storage,
Swift).
Both Instruments Hive and Drill are used to query enormous datasets; Hive is best for batch processing for long-running processes, whereas Drill offers more advancement and a better user experience. Drill's limitation isn't limited to Hadoop; it may also access and process data from other sources.
Ques. 3): What are the
differences between Apache Drill and Druid?
Answer:
The primary distinction is that
Druid pre-aggregates metrics to give low latency queries and minimal storage
use.
You can't save information about
individual events while using Druid to analyse event data.
Drill is a generic abstraction
for a variety of NoSql data stores. Because the values in these data stores are
not pre-aggregated and are saved individually, they can be used for purposes
other than storing aggregated metrics.
Drill does not provide the low
latency queries required to create dynamic reporting dashboards.
Ques. 4): What does Tajo have in
common with Apache Drill?
Answer:
Tajo resembles Drill in
appearance. They do, however, have a lot of differences. Their origins and
eventual purposes are the most significant contrasts. Drill is based on
Google's Dremel, whereas Tajo is based on the combination of MR and parallel
RDBMS. Tajo's goal is a relational and distributed data warehousing system,
whereas Drill's goal is a distributed system for interactive analysis of
large-scale datasets.
As far as I'm aware, the first
Drill contains the following characteristics:
- Drill is a Google Dremel clone project.
- Its primary goal is to do aggregate queries using a full table scan.
- Its main goal is to handle queries quickly.
- It employs a hierarchical data model.
Tajo, on the other hand, has the
following features:
- Tajo combines the benefits of MapReduce and Parallel databases.
- It primarily targets complex data warehouse queries and has its own distributed query evaluation approach.
- Its major goal is scalable processing by exploiting the advantages of MapReduce and Parallel databases.
- We expect that sophisticated query optimization techniques, intermediate data streaming, and online aggregation will significantly reduce query response time.
- It utilizes a relational data model. We feel that the relational data model is sufficient for modelling the vast majority of real-world applications.
- Tajo is expected to be linked with existing BI and OLAP software.
Ques. 5): What are the benefits
of using Apache Drill?
Answer:
Some of the most compelling
reasons to use Apache Drill are listed below.
- Simply untar the Apache Drill and use it in local mode to get started. It does not necessitate the installation of infrastructure or the design of a schema.
- Running SQL queries does not necessitate the use of a schema.
- We can query semi-structured and complex data in real time with Drill.
- The SQL:2003 syntax standard is supported by Apache Drill.
- Drill can be readily linked with BI products like QlikView, Tableau, and MicroStrategy to give analytical capabilities.
- We can use Drill to conduct an interactive query that will access the Hive and HBase tables.
- Drill supports multiple data stores such as local file systems, distributed file systems, Hadoop HDFS, Amazon S3, Hive tables, HBase tables, and so on.
- Apache Drill can be easily scalable from a single system up to 1000 nodes.
Ques. 6): What Are the Great
Features of Apache Drill?
Answer:
The following features are -
- Schema-free JSON document model similar to MongoDB and Elastic search
- Code reusability
- Easy to use and developer friendly
- High performance Java based API
- Memory management system
- Industry-standard API like ANSI SQL, ODBC/JDBC, RESTful APIs
- How does Drill achieve performance?
- Distributed query optimization and execution
- Columnar Execution
- Optimistic Execution
- Pipelined Execution
- Runtime compilation and code generation
- Vectorization
Ques. 7): What are some of the
things we can do with the Apache Web interface?
Answer:
The tasks that we can conduct
through the Apache Drill Web interface are listed below.
- The SQL Queries can be conducted from the Query tab.
- We have the ability to stop and restart running queries.
- We can view the executed queries by looking at the query profile.
- In the storage tab, you can view the storage plugins.
- In the log tab, we can see logs and stats.
Ques. 8): What is Apache Drill's
performance like? Does the number of lines in a query result affect its
performance?
Answer:
We utilise drill for its rest
server and connect D3 visualisation for querying IOT data, and the querying
command(select and join) suffers from a lot of slowness, however this was fixed
when we switched to spark SQL.
Drill is useful in that it can
query most data sources, but it may need to be tested before being used in
production. (If you want something faster, I believe you can find a better
query engine.) But for development and testing, it's been quite useful.
Ques. 9): What Data Storage
Plugins does Apache Drill support?
Answer:
The following is a list of Data
Storage Plugins that Apache Drill supports.
- File System Data Source Storage Plugin
- HBase Data Source Storage Plugin
- Hive Data Source Storage Plugin
- MongoDB Data Source Storage Plugin
- RDBMS Data Source Storage Plugin
- Amazon S3 Data Source Storage Plugin
- Kafka Data Source Storage Plugin
- Azure Blob Data Source Storage Plugin
- HTTP Data Source Storage Plugin
- Elastic Search Data Source Storage Plugin
Apache Cassandra Interview Questions and Answers
Ques. 10): What's the difference
between Apache Solr and Apache Drill, and how do you use them?
Answer:
The distinction between Apache
Solr and Apache Drill is comparable to that between a spoon and a knife. In other
words, despite the fact that they deal with comparable issues, they are
fundamentally different instruments.
To put it plainly... Apache Solr
is a search platform, while Apache Drill is a platform for interactive data
analysis (not restricted to just Hadoop). Before performing searches with Solr,
you must parse and index the data into the system. For Drill, the data is
stored in its raw form (i.e., unprocessed) on a distributed system (e.g.,
Hadoop), and the Drill application instances (i.e., drillbits) will process it
in parallel.
Ques. 11): What is the
recommended performance tuning approach for Apache Drill?
Answer:
To tune Apache Drill's
performance, a user must first understand the data, query plan, and data
source. Once these locations have been discovered, the user can utilise the
performance tuning technique below to increase the query's performance.
- Change the query planning options if necessary.
- Change the broadcast join options as needed.
- Switch the aggregate between one and two phases.
- The hash-based memory-constrained operators can be enabled or disabled.
- We can activate query queuing based on your needs.
- Take command of the parallelization.
- Use partitions to organise your data.
Ques. 12): What should you do if
an Apache Drill query takes a long time to deliver a result?
Answer:
Check the following points if a
query from Apache Drill is taking too long to deliver a result.
- Check the query's profile to determine if it's moving or not. The query progress is determined by the time of the latest update and change.
- Streamline the process where Apache Drill is taking too long.
- Look for partition pruning and projection pushdown operations.
Ques. 13): I'm using Apache Drill
with one drillbit to query approximately 20 GB of data, and each query takes
several minutes to complete. Is this normal?
Answer:
The performance of a single bit
drill is determined by the Java memory setup and resources available on the computer
where your query is being performed. Because the query engine must identify
meaningful matches, the where clause requires more work from the query engine,
which is why it is slower.
You can also alter JVM parameters
in the drill configuration. You can devote more resources to your searches,
which should result in speedier results.
Ques. 14): How does Apache Drill
compare to Apache Phoenix with Hbase in terms of performance?
Answer:
Because Drill is a distributed
query engine, this is a fascinating question. In contrast, Phoenix implements
RDBMS semantics in order to compete with other RDBMS. That isn't to suggest
that Drill won't support inserts and other features... But, because they don't
do the same thing right now, comparing their performance isn't really
apples-to-apples.
Drill can query HBase and even
push query parameters down into the database. Additionally, there is presently
a branch of Drill that can query data stored in Phoenix.
Drill can simultaneously query
numerous data sources. Logically if you choose to use Phoenix, you could use
both to satisfy your business needs.
Ques. 15): Is Apache Drill 1.5
ready for usage in production?
Answer:
Drill is one of the most mature
SQL-on-Hadoop solutions in general. As with all of the SQL-on-Hadoop solutions,
it may or may not be the best fit for your use case. I mention that solely
because I've heard of some extremely far-fetched use cases for Drill that
aren't a good fit.
Drill will serve you well in your
production environment if you wish to run SQL queries without
"requiring" ETL first.
Any tool that supports the ODBC
and JDBC connections can easily access it as well.
Ques. 16): Why doesn't Apache
Drill get the same amount of attention as other SQL-on-Hadoop tools?
Answer:
To keep track of SQL on Hadoop
tools and to advise enterprise customers on which ones would be ideal for them.
A lot of SQL on Hadoop solutions have a large number of users. Presto has been
used by a number of major Internet firms (Netflix, AirBnB), as well as a number
of large corporations. It is largely sponsored by Facebook and Teradata (my
job). The Cloudera distribution makes Impala widely available. Phoenix and
Kylin also make a lot of appearances and have a lot of popularity. Until it
doesn't function or a flaw is discovered, Spark SQL is the go-to for new
projects these days. Hive is the hard to beat incumbent. Adoption is crucial.
Ques. 17): Is it possible to
utilise Apache Drill + MongoDB in the same way that RDBMS is used?
Answer:
To begin, you must comprehend the
significance of NoSQL. To be honest, deciding between NoSQL and RDBMS based on
a million or ten million users is not a great number.
However, as you stated, the size
of your dataset will only grow. You can begin using MongoDB, keeping in mind
the scalability element.
Apache Drill is now available.
Dremel by Google was the
inspiration for Apache drill. When you select columns to retrieve, it performs
well. Multiple data sources can be joined together (e.g. join over hive and
MongoDB, join over RDBMS and MongoDB, etc.)
Also, pure MongoDB or MongoDB +
Apache Drill are both viable options.
MongoDB
Stick to native MongoDB if your
application architecture is entirely based on MongoDB. You have access to all
of MongoDB's features. MongoDB java driver, python driver, REST API, and other
options are available. Yes, learning MongoDB-specific concepts will take more
time. However, RDBMS queries provide you a lot of flexibility, and you can do a
lot of things over here.
MongoDB + Apache Drill
You can choose this option if you
can accomplish your goal with JPA or SQL queries and you are more familiar with
RDBMS queries.
Additional benefit: You can use
dig to query across additional data sources such as hive/HDFS or RDBMS in
addition to MongoDB in the future.
Ques. 18): What is an example of
a real-time use of Apache Drill? What makes Drill superior to Hive?
Answer:
Hive is a batch processing
framework that is best suited for processes that take a long time to complete.
Drill outperforms Hive when it comes to data exploration and business
intelligence.
Drill is also not exclusive to
Hadoop. It can, for example, query NoSQL databases (such as MongoDB and HBase)
and cloud storage (eg, Amazon S3, Google Cloud Storage, Azure Blob Storage,
Swift).
Ques. 19): Is Cloudera Impala
similar to the Apache Drill incubator project?
Answer:
It's difficult to make a fair
comparison because both initiatives are still in the early stages. We still
have a lot of work to do because the Apache Drill project was only started a
few months ago. That said, I believe it is critical to discuss some of the
Apache Drill project's techniques and goals, which are critical to comprehend
when comparing the two:
- Apache Drill is a community-driven product run under the Apache foundation, with all the benefits and guarantees it entails.
- Apache Drill committers are scattered across many different companies.
Apache
Drill is a NoHadoop (not just Hadoop) project with the goal of providing
distributed query capabilities across a variety of large data systems,
including MongoDB, Cassandra, Riak, and Splunk.
- By supporting all major Hadoop distributions, including Apache, Hortonworks, Cloudera, and MapR, Apache Drill avoids vendor lock-in.
- Apache Drill allows you to do queries on hierarchical data.
- JSON and other schemaless data are supported by Apache Drill.
- The Apache Drill architecture is built to make third-party and custom integrations as simple as possible by clearly specifying interfaces for query languages, query optimizers, storage engines, user-defined functions, user-defined nested data functions, and so on.
Clearly, the Apache Drill project
has a lot to offer and a lot of qualities. These things are only achievable
because of the enormous amount of effort and interest that a big number of
firms have begun to contribute to the project, which is only possible because
of the Apache umbrella's power.
Ques. 20): Why is MapR mentioning
Apache Drill so much?
Answer:
Originally Answered: Why is MapR
mentioning Apache Drill so much?
Drill is a new and interesting
low latency SQL-on-Hadoop solution with more functionality than the other
options available, and MapR has done it in the Apache Foundation so that it,
like Hive, is a real community shared open source project, which means it's
more likely to gain wider adoption.
Drill is MapR's baby, so they're
right to be proud of it - it's the most exciting thing to happen to
SQL-on-Hadoop in years. They're also discussing it since it addresses
real-world problems and advances the field.
Consider Drill to be what Impala
could have been if it had more functionality and was part of the Apache
Foundation.