A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data
engineer has set up the necessary AWS Glue connection details and an associated IAM role.
However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an
error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.
Which solution will meet this requirement?
D
Explanation:
The error message indicates that the AWS Glue job cannot access the Amazon S3 bucket through the
VPC endpoint. This could be because the VPC’s route table does not have the necessary routes to
direct the traffic to the endpoint. To fix this, the data engineer must verify that the route table has an
entry for the Amazon S3 service prefix (com.amazonaws.region.s3) with the target as the VPC
endpoint ID. This will allow the AWS Glue job to use the VPC endpoint to access the S3 bucket
without going through the internet or a NAT gateway. For more information, see
Gateway
endpoints
. Reference:
Troubleshoot the AWS Glue error “VPC S3 endpoint validation failed”
Amazon VPC endpoints for Amazon S3
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries
use the data hub to support company-wide analytics. A governance team must ensure that the
company's data analysts can access data only for customers who are within the same country as the
analysts.
Which solution will meet these requirements with the LEAST operational effort?
B
Explanation:
AWS Lake Formation is a service that allows you to easily set up, secure, and manage data lakes. One
of the features of Lake Formation is row-level security, which enables you to control access to specific
rows or columns of data based on the identity or role of the user. This feature is useful for scenarios
where you need to restrict access to sensitive or regulated data, such as customer data from different
countries. By registering the S3 bucket as a data lake location in Lake Formation, you can use the Lake
Formation console or APIs to define and apply row-level security policies to the data in the bucket.
You can also use Lake Formation blueprints to automate the ingestion and transformation of data
from various sources into the data lake. This solution requires the least operational effort compared
to the other options, as it does not involve creating or moving data, or managing multiple tables,
views, or roles. Reference:
AWS Lake Formation
Row-Level Security
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 4: Data Lakes and
Data Warehouses, Section 4.2: AWS Lake Formation
A media company wants to improve a system that recommends media content to customer based on
user behavior and preferences. To improve the recommendation system, the company needs to
incorporate insights from third-party datasets into the company's existing analytics platform.
The company wants to minimize the effort and time required to incorporate third-party datasets.
Which solution will meet these requirements with the LEAST operational overhead?
A
Explanation:
AWS Data Exchange is a service that makes it easy to find, subscribe to, and use third-party data in
the cloud. It provides a secure and reliable way to access and integrate data from various sources,
such as data providers, public datasets, or AWS services. Using AWS Data Exchange, you can browse
and subscribe to data products that suit your needs, and then use API calls or the AWS Management
Console to export the data to Amazon S3, where you can use it with your existing analytics platform.
This solution minimizes the effort and time required to incorporate third-party datasets, as you do
not need to set up and manage data pipelines, storage, or access controls.
You also benefit from the
data quality and freshness provided by the data providers, who can update their data products as
frequently as needed12
.
The other options are not optimal for the following reasons:
B . Use API calls to access and integrate third-party datasets from AWS. This option is vague and does
not specify which AWS service or feature is used to access and integrate third-party datasets. AWS
offers a variety of services and features that can help with data ingestion, processing, and analysis,
but not all of them are suitable for the given scenario.
For example, AWS Glue is a serverless data
integration service that can help you discover, prepare, and combine data from various sources, but
it requires you to create and run data extraction, transformation, and loading (ETL) jobs, which can
add operational overhead3
.
C . Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS
CodeCommit repositories. This option is not feasible, as AWS CodeCommit is a source control service
that hosts secure Git-based repositories, not a data source that can be accessed by Amazon Kinesis
Data Streams. Amazon Kinesis Data Streams is a service that enables you to capture, process, and
analyze data streams in real time, such as clickstream data, application logs, or IoT telemetry. It does
not support accessing and integrating data from AWS CodeCommit repositories, which are meant for
storing and managing code, not data .
D . Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon
Elastic Container Registry (Amazon ECR). This option is also not feasible, as Amazon ECR is a fully
managed container registry service that stores, manages, and deploys container images, not a data
source that can be accessed by Amazon Kinesis Data Streams. Amazon Kinesis Data Streams does not
support accessing and integrating data from Amazon ECR, which is meant for storing and managing
container images, not data .
Reference:
: AWS Data Exchange User Guide
: AWS Data Exchange FAQs
: AWS Glue Developer Guide
: AWS CodeCommit User Guide
: Amazon Kinesis Data Streams Developer Guide
: Amazon Elastic Container Registry User Guide
: Build a Continuous Delivery Pipeline for Your Container Images with Amazon ECR as Source
A financial company wants to implement a data mesh. The data mesh must support centralized data
governance, data analysis, and data access control. The company has decided to use AWS Glue for
data catalogs and extract, transform, and load (ETL) operations.
Which combination of AWS services will implement a data mesh? (Choose two.)
B E
Explanation:
A data mesh is an architectural framework that organizes data into domains and treats data as
products that are owned and offered for consumption by different teams1
. A data mesh requires a
centralized layer for data governance and access control, as well as a distributed layer for data
storage and analysis.
AWS Glue can provide data catalogs and ETL operations for the data mesh, but
it cannot provide data governance and access control by itself2
. Therefore, the company needs to
use another AWS service for this purpose.
AWS Lake Formation is a service that allows you to create,
secure, and manage data lakes on AWS3
. It integrates with AWS Glue and other AWS services to
provide centralized data governance and access control for the data mesh. Therefore, option E is
correct.
For data storage and analysis, the company can choose from different AWS services depending on
their needs and preferences.
However, one of the benefits of a data mesh is that it enables data to be
stored and processed in a decoupled and scalable way1
. Therefore, using serverless or managed
services that can handle large volumes and varieties of data is preferable. Amazon S3 is a highly
scalable, durable, and secure object storage service that can store any type of data. Amazon Athena
is a serverless interactive query service that can analyze data in Amazon S3 using standard SQL.
Therefore, option B is a good choice for data storage and analysis in a data mesh. Option A, C, and D
are not optimal because they either use relational databases that are not suitable for storing diverse
and unstructured data, or they require more management and provisioning than serverless
services. Reference:
: What is a Data Mesh? - Data Mesh Architecture Explained - AWS
: AWS Glue - Developer Guide
: AWS Lake Formation - Features
[4]: Design a data mesh architecture using AWS Lake Formation and AWS Glue
[5]: Amazon S3 - Features
[6]: Amazon Athena - Features
A data engineer maintains custom Python scripts that perform a data formatting process that many
AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data
engineer must manually update all the Lambda functions.
The data engineer requires a less manual way to update the Lambda functions.
Which solution will meet this requirement?
B
Explanation:
Lambda layers are a way to share code and dependencies across multiple Lambda functions. By
packaging the custom Python scripts into Lambda layers, the data engineer can update the scripts in
one place and have them automatically applied to all the Lambda functions that use the layer. This
reduces the manual effort and ensures consistency across the Lambda functions. The other options
are either not feasible or not efficient. Storing a pointer to the custom Python scripts in the execution
context object or in environment variables would require the Lambda functions to download the
scripts from Amazon S3 every time they are invoked, which would increase latency and cost.
Assigning the same alias to each Lambda function would not help with updating the Python scripts,
as the alias only points to a specific version of the Lambda function code. Reference:
AWS Lambda layers
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 3: Data Ingestion
and Transformation, Section 3.4: AWS Lambda
A company created an extract, transform, and load (ETL) data pipeline in AWS Glue. A data engineer
must crawl a table that is in Microsoft SQL Server. The data engineer needs to extract, transform, and
load the output of the crawl to an Amazon S3 bucket. The data engineer also must orchestrate the
data pipeline.
Which AWS service or feature will meet these requirements MOST cost-effectively?
B
Explanation:
AWS Glue workflows are a cost-effective way to orchestrate complex ETL jobs that involve multiple
crawlers, jobs, and triggers. AWS Glue workflows allow you to visually monitor the progress and
dependencies of your ETL tasks, and automatically handle errors and retries. AWS Glue workflows
also integrate with other AWS services, such as Amazon S3, Amazon Redshift, and AWS Lambda,
among others, enabling you to leverage these services for your data processing workflows. AWS Glue
workflows are serverless, meaning you only pay for the resources you use, and you don’t have to
manage any infrastructure.
AWS Step Functions, AWS Glue Studio, and Amazon MWAA are also possible options for
orchestrating ETL pipelines, but they have some drawbacks compared to AWS Glue workflows. AWS
Step Functions is a serverless function orchestrator that can handle different types of data
processing, such as real-time, batch, and stream processing. However, AWS Step Functions requires
you to write code to define your state machines, which can be complex and error-prone. AWS Step
Functions also charges you for every state transition, which can add up quickly for large-scale ETL
pipelines.
AWS Glue Studio is a graphical interface that allows you to create and run AWS Glue ETL jobs without
writing code. AWS Glue Studio simplifies the process of building, debugging, and monitoring your
ETL jobs, and provides a range of pre-built transformations and connectors. However, AWS Glue
Studio does not support workflows, meaning you cannot orchestrate multiple ETL jobs or crawlers
with dependencies and triggers. AWS Glue Studio also does not support streaming data sources or
targets, which limits its use cases for real-time data processing.
Amazon MWAA is a fully managed service that makes it easy to run open-source versions of Apache
Airflow on AWS and build workflows to run your ETL jobs and data pipelines. Amazon MWAA
provides a familiar and flexible environment for data engineers who are familiar with Apache Airflow,
and integrates with a range of AWS services such as Amazon EMR, AWS Glue, and AWS Step
Functions. However, Amazon MWAA is not serverless, meaning you have to provision and pay for the
resources you need, regardless of your usage. Amazon MWAA also requires you to write code to
define your DAGs, which can be challenging and time-consuming for complex ETL
pipelines. Reference:
AWS Glue Workflows
AWS Step Functions
AWS Glue Studio
Amazon MWAA
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
A financial services company stores financial data in Amazon Redshift. A data engineer wants to run
real-time queries on the financial data to support a web-based trading application. The data engineer
wants to run the queries from within the trading application.
Which solution will meet these requirements with the LEAST operational overhead?
B
Explanation:
The Amazon Redshift Data API is a built-in feature that allows you to run SQL queries on Amazon
Redshift data with web services-based applications, such as AWS Lambda, Amazon SageMaker
notebooks, and AWS Cloud9. The Data API does not require a persistent connection to your
database, and it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the
endpoint to run SQL statements without managing connections. The Data API also supports both
Amazon Redshift provisioned clusters and Redshift Serverless workgroups. The Data API is the best
solution for running real-time queries on the financial data from within the trading application, as it
has the least operational overhead compared to the other options.
Option A is not the best solution, as establishing WebSocket connections to Amazon Redshift would
require more configuration and maintenance than using the Data API. WebSocket connections are
also not supported by Amazon Redshift clusters or serverless workgroups.
Option C is not the best solution, as setting up JDBC connections to Amazon Redshift would also
require more configuration and maintenance than using the Data API. JDBC connections are also not
supported by Redshift Serverless workgroups.
Option D is not the best solution, as storing frequently accessed data in Amazon S3 and using
Amazon S3 Select to run the queries would introduce additional latency and complexity than using
the Data API. Amazon S3 Select is also not optimized for real-time queries, as it scans the entire
object before returning the results. Reference:
Using the Amazon Redshift Data API
Calling the Data API
Amazon Redshift Data API Reference
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The
company has several use cases. The company must implement permission controls to separate query
processes and access to query history among users, teams, and applications that are in the same
AWS account.
Which solution will meet these requirements?
B
Explanation:
Athena workgroups are a way to isolate query execution and query history among users, teams, and
applications that share the same AWS account. By creating a workgroup for each use case, the
company can control the access and actions on the workgroup resource using resource-level IAM
permissions or identity-based IAM policies. The company can also use tags to organize and identify
the workgroups, and use them as conditions in the IAM policies to grant or deny permissions to the
workgroup. This solution meets the requirements of separating query processes and access to query
history among users, teams, and applications that are in the same AWS account. Reference:
Athena Workgroups
IAM policies for accessing workgroups
Workgroup example policies
A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data
engineer does not require the Glue jobs to run or finish at a specific time.
Which solution will run the Glue jobs in the MOST cost-effective way?
A
Explanation:
The FLEX execution class allows you to run AWS Glue jobs on spare compute capacity instead of
dedicated hardware. This can reduce the cost of running non-urgent or non-time sensitive data
integration workloads, such as testing and one-time data loads. The FLEX execution class is available
for AWS Glue 3.0 Spark jobs. The other options are not as cost-effective as FLEX, because they either
use dedicated resources (STANDARD) or do not affect the cost at all (Spot Instance type and
GlueVersion). Reference:
Introducing AWS Glue Flex jobs: Cost savings on ETL workloads
Serverless Data Integration – AWS Glue Pricing
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide (Chapter 5, page 125)
A data engineer needs to create an AWS Lambda function that converts the format of data from .csv
to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3
bucket.
Which solution will meet these requirements with the LEAST operational overhead?
A
Explanation:
Option A is the correct answer because it meets the requirements with the least operational
overhead. Creating an S3 event notification that has an event type of s3:ObjectCreated:* will trigger
the Lambda function whenever a new object is created in the S3 bucket. Using a filter rule to
generate notifications only when the suffix includes .csv will ensure that the Lambda function only
runs for .csv files. Setting the ARN of the Lambda function as the destination for the event
notification will directly invoke the Lambda function without any additional steps.
Option B is incorrect because it requires the user to tag the objects with .csv, which adds an extra
step and increases the operational overhead.
Option C is incorrect because it uses an event type of s3:*, which will trigger the Lambda function for
any S3 event, not just object creation. This could result in unnecessary invocations and increased
costs.
Option D is incorrect because it involves creating and subscribing to an SNS topic, which adds an
extra layer of complexity and operational overhead.
Reference:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 3: Data Ingestion
and Transformation, Section 3.2: S3 Event Notifications and Lambda Functions, Pages 67-69
Building Batch Data Analytics Solutions on AWS
, Module 4: Data Transformation, Lesson 4.2: AWS
Lambda, Pages 4-8
AWS Documentation Overview
, AWS Lambda Developer Guide, Working with AWS Lambda
Functions, Configuring Function Triggers, Using AWS Lambda with Amazon S3, Pages 1-5
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the
files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also
notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
C
Explanation:
Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon
S3 using standard SQL. Athena supports various data formats, such as CSV, JSON, ORC, Avro, and
Parquet. However, not all data formats are equally efficient for querying. Some data formats, such as
CSV and JSON, are row-oriented, meaning that they store data as a sequence of records, each with
the same fields. Row-oriented formats are suitable for loading and exporting data, but they are not
optimal for analytical queries that often access only a subset of columns. Row-oriented formats also
do not support compression or encoding techniques that can reduce the data size and improve the
query performance.
On the other hand, some data formats, such as ORC and Parquet, are column-oriented, meaning that
they store data as a collection of columns, each with a specific data type. Column-oriented formats
are ideal for analytical queries that often filter, aggregate, or join data by columns. Column-oriented
formats also support compression and encoding techniques that can reduce the data size and
improve the query performance. For example, Parquet supports dictionary encoding, which replaces
repeated values with numeric codes, and run-length encoding, which replaces consecutive identical
values with a single value and a count. Parquet also supports various compression algorithms, such
as Snappy, GZIP, and ZSTD, that can further reduce the data size and improve the query performance.
Therefore, changing the data format from CSV to Parquet and applying Snappy compression will
most speed up the Athena query performance. Parquet is a column-oriented format that allows
Athena to scan only the relevant columns and skip the rest, reducing the amount of data read from
S3. Snappy is a compression algorithm that reduces the data size without compromising the query
speed, as it is splittable and does not require decompression before reading. This solution will also
reduce the cost of Athena queries, as Athena charges based on the amount of data scanned from S3.
The other options are not as effective as changing the data format to Parquet and applying Snappy
compression. Changing the data format from CSV to JSON and applying Snappy compression will not
improve the query performance significantly, as JSON is also a row-oriented format that does not
support columnar access or encoding techniques. Compressing the CSV files by using Snappy
compression will reduce the data size, but it will not improve the query performance significantly, as
CSV is still a row-oriented format that does not support columnar access or encoding techniques.
Compressing the CSV files by using gzjg compression will reduce the data size, but it will degrade the
query performance, as gzjg is not a splittable compression algorithm and requires decompression
before reading. Reference:
Amazon Athena
Choosing the Right Data Format
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 5: Data Analysis
and Visualization, Section 5.1: Amazon Athena
A manufacturing company collects sensor data from its factory floor to monitor and enhance
operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the
sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3
bucket.
The company needs to display a real-time view of operational efficiency on a large screen in the
manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
C
Explanation:
This solution will meet the requirements with the lowest latency because it uses Amazon Managed
Service for Apache Flink to process the sensor data in real time and write it to Amazon Timestream, a
fast, scalable, and serverless time series database. Amazon Timestream is optimized for storing and
analyzing time series data, such as sensor data, and can handle trillions of events per day with
millisecond latency. By using Amazon Timestream as a source, you can create an Amazon QuickSight
dashboard that displays a real-time view of operational efficiency on a large screen in the
manufacturing facility.
Amazon QuickSight is a fully managed business intelligence service that can
connect to various data sources, including Amazon Timestream, and provide interactive
visualizations and insights123
.
The other options are not optimal for the following reasons:
A . Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data
Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon
Timestream database. Use the Timestream database as a source to create a Grafana dashboard. This
option is similar to option C, but it uses Grafana instead of Amazon QuickSight to create the
dashboard. Grafana is an open source visualization tool that can also connect to Amazon
Timestream, but it requires additional steps to set up and configure, such as deploying a Grafana
server on Amazon EC2, installing the Amazon Timestream plugin, and creating an IAM role for
Grafana to access Timestream. These steps can increase the latency and complexity of the solution.
B . Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is
created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to
create an Amazon QuickSight dashboard. This option is not suitable for displaying a real-time view of
operational efficiency, as it introduces unnecessary delays and costs in the data pipeline. First, the
sensor data is written to an S3 bucket by Amazon Kinesis Data Firehose, which can have a buffering
interval of up to 900 seconds. Then, the S3 bucket sends a notification to a Lambda function, which
can incur additional invocation and execution time. Finally, the Lambda function publishes the data
to Amazon Aurora, a relational database that is not optimized for time series data and can have
higher storage and performance costs than Amazon Timestream .
D . Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to
an Amazon Timestream database. Use the Timestream database as a source to create a Grafana
dashboard. This option is also not suitable for displaying a real-time view of operational efficiency, as
it uses AWS Glue bookmarks to read sensor data from the S3 bucket. AWS Glue bookmarks are a
feature that helps AWS Glue jobs and crawlers keep track of the data that has already been
processed, so that they can resume from where they left off. However, AWS Glue jobs and crawlers
are not designed for real-time data processing, as they can have a minimum frequency of 5 minutes
and a variable start-up time. Moreover, this option also uses Grafana instead of Amazon QuickSight
to create the dashboard, which can increase the latency and complexity of the solution .
Reference:
: Amazon Managed Streaming for Apache Flink
: Amazon Timestream
: Amazon QuickSight
: Analyze data in Amazon Timestream using Grafana
: Amazon Kinesis Data Firehose
: Amazon Aurora
: AWS Glue Bookmarks
: AWS Glue Job and Crawler Scheduling
A company stores daily records of the financial performance of investment portfolios in .csv format in
an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
B
Explanation:
To make the S3 data accessible daily in the AWS Glue Data Catalog, the data engineer needs to create
a crawler that can crawl the S3 data and write the metadata to the Data Catalog. The crawler also
needs to run on a daily schedule to keep the Data Catalog updated with the latest data. Therefore,
the solution must include the following steps:
Create an IAM role that has the necessary permissions to access the S3 data and the Data
Catalog.
The AWSGlueServiceRole policy is a managed policy that grants these permissions1
.
Associate the role with the crawler.
Specify the S3 bucket path of the source data as the crawler’s data store.
The crawler will scan the
data and infer the schema and format2
.
Create a daily schedule to run the crawler.
The crawler will run at the specified time every day and
update the Data Catalog with any changes in the data3
.
Specify a database name for the output. The crawler will create or update a table in the Data Catalog
under the specified database. The table will contain the metadata about the data in the S3 bucket,
such as the location, schema, and classification.
Option B is the only solution that includes all these steps. Therefore, option B is the correct answer.
Option A is incorrect because it configures the output destination to a new path in the existing S3
bucket. This is unnecessary and may cause confusion, as the crawler does not write any data to the
S3 bucket, only metadata to the Data Catalog.
Option C is incorrect because it allocates data processing units (DPUs) to run the crawler every day.
This is also unnecessary, as DPUs are only used for AWS Glue ETL jobs, not crawlers.
Option D is incorrect because it combines the errors of option A and C. It configures the output
destination to a new path in the existing S3 bucket and allocates DPUs to run the crawler every day,
both of which are irrelevant for the crawler.
Reference:
: AWS managed (predefined) policies for AWS Glue - AWS Glue
: Data Catalog and crawlers in AWS Glue - AWS Glue
: Scheduling an AWS Glue crawler - AWS Glue
[4]: Parameters set on Data Catalog tables by crawler - AWS Glue
[5]: AWS Glue pricing - Amazon Web Services (AWS)
A company loads transaction data for each day into Amazon Redshift tables at the end of each day.
The company wants to have the ability to track which tables have been loaded and which tables still
need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table.
The data engineer creates an AWS Lambda function to publish the details of the load statuses to
DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB
table?
C
Explanation:
The Amazon Redshift Data API enables you to interact with your Amazon Redshift data warehouse in
an easy and secure way. You can use the Data API to run SQL commands, such as loading data into
tables, without requiring a persistent connection to the cluster. The Data API also integrates with
Amazon EventBridge, which allows you to monitor the execution status of your SQL commands and
trigger actions based on events. By using the Data API to publish an event to EventBridge, the data
engineer can invoke the Lambda function that writes the load statuses to the DynamoDB table. This
solution is scalable, reliable, and cost-effective. The other options are either not possible or not
optimal. You cannot use a second Lambda function to invoke the first Lambda function based on
CloudWatch or CloudTrail events, as these services do not capture the load status of Redshift tables.
You can use the Data API to publish a message to an SQS queue, but this would require additional
configuration and polling logic to invoke the Lambda function from the queue. This would also
introduce additional latency and cost. Reference:
Using the Amazon Redshift Data API
Using Amazon EventBridge with Amazon Redshift
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
, Chapter 2: Data Store
Management, Section 2.2: Amazon Redshift
A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an
Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be
regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data
engineer needs to automate the transfer process and must schedule the process to run periodically.
Which AWS service should the data engineer use to transfer the data in the MOST operationally
efficient way?
A
Explanation:
AWS DataSync is an online data movement and discovery service that simplifies and accelerates data
migrations to AWS as well as moving data to and from on-premises storage, edge locations, other
cloud providers, and AWS Storage services1
. AWS DataSync can copy data to and from various
sources and targets, including Amazon S3, and handle files in multiple formats. AWS DataSync also
supports incremental transfers, meaning it can detect and copy only the changes to the data,
reducing the amount of data transferred and improving the performance.
AWS DataSync can
automate and schedule the transfer process using triggers, and monitor the progress and status of
the transfers using CloudWatch metrics and events1
.
AWS DataSync is the most operationally efficient way to transfer the data in this scenario, as it meets
all the requirements and offers a serverless and scalable solution. AWS Glue, AWS Direct Connect,
and Amazon S3 Transfer Acceleration are not the best options for this scenario, as they have some
limitations or drawbacks compared to AWS DataSync.
AWS Glue is a serverless ETL service that can
extract, transform, and load data from various sources to various targets, including Amazon
S32
.
However, AWS Glue is not designed for large-scale data transfers, as it has some quotas and
limits on the number and size of files it can process3
. AWS Glue also does not support incremental
transfers, meaning it would have to copy the entire data set every time, which would be inefficient
and costly.
AWS Direct Connect is a service that establishes a dedicated network connection between your on-
premises data center and AWS, bypassing the public internet and improving the bandwidth and
performance of the data transfer. However, AWS Direct Connect is not a data transfer service by
itself, as it requires additional services or tools to copy the data, such as AWS DataSync, AWS Storage
Gateway, or AWS CLI. AWS Direct Connect also has some hardware and location requirements, and
charges you for the port hours and data transfer out of AWS.
Amazon S3 Transfer Acceleration is a feature that enables faster data transfers to Amazon S3 over
long distances, using the AWS edge locations and optimized network paths. However, Amazon S3
Transfer Acceleration is not a data transfer service by itself, as it requires additional services or tools
to copy the data, such as AWS CLI, AWS SDK, or third-party software. Amazon S3 Transfer
Acceleration also charges you for the data transferred over the accelerated endpoints, and does not
guarantee a performance improvement for every transfer, as it depends on various factors such as
the network conditions, the distance, and the object size. Reference:
AWS DataSync
AWS Glue
AWS Glue quotas and limits
[AWS Direct Connect]
[Data transfer options for AWS Direct Connect]
[Amazon S3 Transfer Acceleration]
[Using Amazon S3 Transfer Acceleration]