AWS Glue is a serverless data
integration tool that makes finding, preparing, and combining data for
analytics, machine learning, and application development a breeze. AWS Glue has
all of the data integration features you'll need, so you can start analysing
and using your data in minutes rather than months. To make data integration
easier, AWS Glue offers both visual and code-based interfaces. The AWS Glue
Data Catalog allows users to quickly locate and retrieve data. With a few
clicks in AWS Glue Studio, data engineers and ETL (extract, transform, and
load) developers can graphically construct, run, and monitor ETL workflows.
Data analysts and scientists can use AWS Glue DataBrew to visually enrich,
clean, and standardise data without having to write code. Application
developers may utilise Structured Query Language (SQL) to mix and replicate
data across disparate data stores with AWS Glue Elastic Views.
AWS Cloud Practitioner Essentials
Questions and Answers
Amazon
Web Services Interview Questions and Answers
AWS Cloud Interview Questions and Answers
Ques. 1): What are your thoughts on
AWS Glue?
Answer:
AWS Glue is a service that makes
categorising, cleaning, and reliably moving data across various data stores and
data streams simple and cost effective.
- It comprises of the SWA Glue Catalog, a central metadata repository.
- By handling dependency resolution, task monitoring, and
retries, AWS Glue assists in the generation of Python or Scala code.
- AWS Glue is a serverless infrastructure that is easy to
set up and manage, and it has a dynamic frame component that we can use in
our ETL scripts.
- Dynamic Frame is the same as an Apache Spark dataframe
and is a data abstraction for organising data into rows and columns.
AWS EC2 Interview Questions and Answers
Amazon EMR Interview Questions and Answers
Amazon OpenSearch Interview Questions and Answers
Ques. 2): Which Data Stores Can I
Crawl using Glue?
Answer:
Crawlers can crawl the following data stores through a JDBC connection:
- Amazon Redshift
- Amazon Relational Database Service (Amazon RDS)
- Amazon Aurora
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Publicly accessible databases
AWS RedShift Interview Questions and Answers
AWS FinSpace Interview Questions and Answers
AWS MSK Interview Questions and Answers
Ques. 3): What components does AWS
Glue make use of?
Answer:
AWS Glue is made up of the following
ingredients:
- Data Catalog is a Metadata Repository on the Cloud.
- The ETL Engine assists with the generation of Python
and Scala code.
- Flexible Scheduler aids in the resolution of
dependencies, job monitoring, and retring.
- AWS Glue DataBrew provides a visual interface for
normalizing and cleaning data.
- Replicating and combining data across various data
stores with AWS Glue Elastic View.
AWS Lambda Interview Questions and Answers
AWS Simple Notification Service (SNS) Interview Questions and Answers
AWS QuickSight Interview Questions and Answers
Ques. 4): What is AWS Glue DataBrew,
and how does it work?
Answer:
AWS Glue DataBrew is a visual data
preparation solution that allows data analysts and scientists to prepare data
without writing code using an interactive, point-and-click visual interface.
You can easily view, clean, and normalise terabytes, if not petabytes, of data
directly from your data lake, data warehouses, and databases, including Amazon
S3, Amazon Redshift, Amazon Aurora, and Amazon RDS, using Glue DataBrew. AWS
Glue DataBrew is now broadly available in the US East (North Carolina), US East
(Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Sydney),
and Asia Pacific (Frankfurt) regions (Tokyo).
AWS Cloud Security Interview Questions and Answers
AWS SQS Interview Questions and Answers
AWS AppFlow Interview Questions and Answers
Ques. 5): What steps do I need to
take to get my metadata into the AWS Glue Data Catalog?
Answer:
There are several ways to populate
metadata into the AWS Glue Data Catalog with AWS Glue. Glue crawlers
automatically deduce schemas and partition structure from various data sources
you control, populating the Glue Data Catalog with corresponding table
definitions and statistics. You can also schedule crawlers to run on a regular
basis to keep your metadata current and in sync with the underlying data.
Alternatively, you can use the AWS Glue Console or the API to manually add and
change table details. On an Amazon EMR cluster, you can also run Hive DDL
statements via the Amazon Athena Console or a Hive client. Finally, if you
already have a persistent Apache Hive Metastore, you can perform a bulk import
of that metadata into the AWS Glue Data Catalog by using our import script.
AWS Simple Storage Service (S3) Interview Questions and Answers
AWS QLDB Interview Questions and Answers
AWS STEP Functions Interview Questions and Answers
Ques. 6): What steps does AWS Glue
take to deduplicate my data?
Answer:
The FindMatches ML Transform in AWS
Glue makes it simple to locate and link records that refer to the same entity
but lack a unique identifier. Before FindMatches, data-matching problems were
usually solved deterministically by constructing a large number of hand-tuned
rules. Behind the scenes, FindMatches employs machine learning algorithms to
learn how to match records based on each developer's specific business
criteria. FindMatches first selects records for the client to categorise as
matching or not matching, and then creates an ML Transform using machine
learning.
AWS Fargate Interview Questions and Answers
Amazon Managed Blockchain Questions and Answers
AWS Message Queue(MQ) Interview Questions and Answers
Ques. 7): To use AWS Glue DataBrew,
do I need to use AWS Glue Data Catalog or AWS Lake Formation?
Answer:
No. You don't need the AWS Glue Data
Catalog or AWS Lake Formation to use AWS Glue DataBrew. If you utilise either
the AWS Glue Data Catalog or AWS Lake Formation, DataBrew users can choose from
a centralised data catalogue of data sets available to them.
AWS SageMaker Interview Questions and Answers
AWS Serverless Application Model(SAM) Interview Questions and Answers
AWS X-Ray Interview Questions and Answers
Ques. 8): What are the benefits of
using AWS Glue Schema Registry?
Answer:
- Validate schemas using the AWS Glue Schema Registry.
Schemas used for data production are checked against schemas in a central
registry when data streaming apps are linked with AWS Glue Schema
Registry, allowing you to centrally regulate data quality.
- Maintain the evolution of the schema. One of eight
compatibility modes can be used to specify criteria for how schemas can
and cannot grow.
- Improve the quality of your data. Serializers compare
data producers' schemas to those in the registry, enhancing data quality
at the source and avoiding downstream difficulties caused by unexpected
schema drift.
- Save money. Serializers transform data into a binary
format, which can then be compressed before being provided, lowering data
transit and storage costs.
- Increase the speed of processing. A data stream often
contains records with multiple schemas. The Schema Registry allows
applications that read data streams to choose process each record based on
the schema rather than parsing its contents, improving processing
efficiency.
AWS DynamoDB Interview Questions and Answers
AWS Wavelength Interview Questions and Answers
AWS Outposts Interview Questions and Answers
Ques. 9): What are the benefits of
using AWS Glue Elastic Views?
Answer:
To aggregate and constantly
replicate data across various data stores in near-real time, you should use AWS
Glue Elastic Views. This is often the case when developing new application
functionality that requires access to data from one or more current data
stores. An organisation might, for example, utilise a customer relationship
management (CRM) programme to keep track of client connections and an
e-commerce website to conduct online transactions. These applications would
store data in one or more data stores. The organisation is now developing a new
bespoke application that generates and presents unique offers to active website
visitors. This programme accomplishes this by combining customer data from the
CRM application with online clickstream data from the e-commerce application. A
developer can create new functionality in three phases using AWS Glue Elastic
Views. First, they use AWS Glue Elastic Views to connect the CRM and e-commerce
application data stores. Then, using SQL, they choose the appropriate data from
the CRM and e-commerce data databases. Finally, they connect the data storage
of the custom application to the results.
AWS Cloudwatch interview Questions and Answers
AWS Lightsail Questions and Answers
Ques. 10): Which AWS services and
open source projects make use of AWS Glue Data Catalog?
Answer:
Following are the AWS services and
open source projects that make use of the AWS Glue Data Catalog include:
- AWS Lake Formation
- Amazon Athena
- Amazon Redshift Spectrum
- Amazon EMR
- AWS Glue Data Catalog Client for Apache Hive Metastore
AWS Elastic Block Store (EBS) Interview Questions and Answers
AWS ElastiCache Interview Questions and Answers
AWS ECR Interview Questions and Answers
Ques. 11): When should I employ a
Glue Classifier?
Answer:
When you crawl a data store to
define metadata tables in the AWS Glue Data Catalog, you employ classifiers.
You can use an ordered set of classifiers to set up your crawler. When a
crawler calls a classifier, the classifier determines whether or not the data
has been identified. If the first classifier fails to recognise the data or is
unsure, the crawler moves on to the next classifier in the list to see if it
can recognise the data.
AWS Amplify Interview Questions and Answers
AWS DocumentDB Interview Questions and Answers
AWS EC2 Auto Scaling Interview Questions and Answers
Ques. 12): How can I connect to the AWS Glue Schema Registry in a secure manner?
Answer:
By configuring an interface VPC
endpoint for AWS Glue, you may use AWS PrivateLink to link your data producer's
VPC to AWS Glue. Communication between your VPC and AWS Glue occurs entirely
within the AWS network when you use a VPC interface endpoint.
AWS Cloud Interview Questions and Answers
AWS Compute Optimizer Interview Questions and Answers
AWS CodeStar Interview Questions and Answers
Ques. 13): When a crawler runs, what
happens?
Answer:
To examine a data storage, a crawler
performs the following actions:
- Create a custom classifier to customise the results of
classification in order to determine the raw data's format, schema, and
associated attributes.
- Data is organised into tables or partitions based on
crawler algorithms.
- You can control how the crawler adds, changes, and
deletes tables and partitions by configuring how it writes metadata to the
Data Catalog.
AWS Secrets Manager Interview Questions and Answers
AWS CloudShell Interview Questions and Answers
AWS Batch Interview Questions and Answers
Ques. 14): When should I utilise
Amazon EMR vs. AWS Glue?
Answer:
AWS Glue is a scale-out execution
environment for your data transformation activities that runs on top of the
Apache Spark ecosystem. AWS Glue infers, adapts, and monitors your ETL jobs,
making job creation and maintenance more easier. Amazon EMR gives you direct
access to your Hadoop environment, allowing you to access it at a lower level
and use tools other than Apache Spark.
AWS
Django Interview Questions and Answers
AWS App2Container Questions and Answers
AWS App Runner Questions and Answers
Ques. 15): What options do I have
for customising the ETL code provided by AWS Glue?
Answer:
Scala or Python code is generated by
AWS Glue's ETL script suggestion algorithm. It makes use of Glue's custom ETL
framework to make it easier to access data sources and manage job execution. You can
use AWS Glue's own library to write ETL code, or you can use inline editing in
the AWS Glue Console script editor to write arbitrary code in Scala or Python,
then download the auto-generated code and edit it in your own
IDE.
AWS Cloud Support Engineer Interview Question and Answers
AWS Timestream Interview Questions
and Answers
AWS PinPoint Questions and Answers
Ques. 16): How am I charged for AWS
Glue?
Answer:
Over and beyond the AWS Glue Data
Catalog free tier, you'll pay a basic monthly cost to store and retrieve
metadata in the AWS Glue Data Catalog. The crawler run will cost you an hourly
charge, paid per second, with a 10-minute minimum. If you want to use a
development endpoint to create your ETL code interactively, you will be charged
an hourly rate, billed per second, for the time it takes to provision your
development endpoint, with a 10-minute minimum. In addition, depending on the
Glue version you choose, you'll pay an hourly cost, billed per second, for the
ETL process, with a 1-minute or 10-minute minimum.
AWS Solution Architect Interview Questions and Answers
AWS Neptune Interview Questions and Answers
AWS MemoryDB Questions and Answers
Ques. 17): Is it possible to monitor
and troubleshoot AWS Glue ETL operations using the Apache Spark web UI?
Answer:
Yes, you can monitor and debug AWS
Glue ETL processes running on the AWS Glue job system, as well as Apache Spark applications running on AWS Glue development endpoints, using the Apache Spark
web UI. For each job, the Spark UI allows you to verify the following:
- Each Spark stage's event chronology
- The job's directed acyclic graph (DAG).
- SparkSQL query physical and logical plans
- For each job, the underlying Spark environmental
variables
AWS Aurora Interview Questions and Answers
AWS EventBridge Interview Questions and Answers
Ques. 18): What happens if AWS Glue
encounters an ETL error?
Answer:
AWS Glue keeps track of job event
metrics and faults and sends all alerts to Amazon CloudWatch. With Amazon
CloudWatch, you can set up a variety of actions to be triggered in response to
certain AWS Glue notifications. You may use an AWS Lambda function to handle an
error or success notification from Glue, for example. The default retry
behaviour in Glue is to retry all failures three times before sending an error
notification.
AWS DevOps Cloud Interview Questions and Answers
Ques. 19): When a glue crawler
decides to construct partitions, how does it do so?
Answer:
When an AWS Glue crawler searches an Amazon S3 path and finds several folders in a bucket, it determines which folders are table partitions and the root of a table in the folder structure. The table's name is derived from the Amazon S3 prefix, or folder name. You specify an Include path that points to the crawled folder level. The crawler makes divisions of a table instead of two independent tables when the majority of schemas at a folder level are similar.
AWS AppSync Interview Questions and Answers
Ques. 20): AWS Glue Elastic Views
currently supports which sources and targets?
Answer:
Amazon DynamoDB is now supported for the preview, with Amazon Aurora MySQL, Amazon Aurora PostgreSQL, Amazon RDS for MySQL, and Amazon RDS for PostgreSQL to follow. Amazon Redshift, Amazon S3, and Amazon OpenSearch Service are now supported targets, with support for Amazon Aurora MySQL, Amazon Aurora PostgreSQL, Amazon RDS for MySQL, and Amazon RDS for PostgreSQL on the way.
AWS Database Interview Questions and
Answers
More AWS interview Questions and Answers:
AWS Glue Interview Questions and Answers
Amazon Athena Interview Questions and Answers
AWS VPC Interview Questions and Answers
AWS DevOps Cloud Interview Questions and Answers
AWS Cloud9 Interview Questions and Answers
AWS Database Interview Questions and Answers
AWS ActiveMQ Interview Questions and Answers
AWS CloudFormation Interview Questions and Answers
AWS GuardDuty Questions and Answers
AWS Control Tower Interview Questions and Answers
AWS Lake Formation Interview Questions and Answers
AWS Data Pipeline Interview Questions and Answers
Amazon CloudSearch Interview Questions and Answers
AWS Transit Gateway Interview Questions and Answers
Amazon Detective Interview Questions and Answers