AWS Glue bookmarks track processed data to prevent reprocessing, but the job must commit the bookmark state at the end of each run.
A data engineer set up an AWS Lambda function to read an object that is stored in an Amazon S3 bucket. The object is encrypted by an AWS KMS key.
The data engineer configured the Lambda functions execution role to access the S3 bucket. However, the Lambda function encountered an error and failed to retrieve the content of the object.
What is the likely cause of the error?
d
Two developers are working on separate application releases. The developers have created feature branches named Branch A and Branch B by using a GitHub repositorys master branch as the source.
The developer for Branch A deployed code to the production system. The code for Branch B will merge into a master branch in the following weeks scheduled application release.
Which command should the developer for Branch B run before the developer raises a pull request to the master branch?
c
A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
b
A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job.
The data engineer has set the maximum concurrency for the AWS Glue job to 1.
The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.
What is the likely reason the AWS Glue job is reprocessing the files?
a
AWS Glue bookmarks track processed data to prevent reprocessing, but the job must commit the bookmark state at the end of each run.
A company has implemented a lake house architecture in Amazon Redshift. The company needs to give users the ability to authenticate into Redshift query editor by using a third-party identity provider (IdP).
A data engineer must set up the authentication mechanism.
What is the first step the data engineer should take to meet this requirement?
a
A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format.
Which solution will meet these requirements with the LEAST development effort?
b
Firehose does not support direct .csv to JSON conversion, requiring additional processing and hence it makes D as our answer as Lambda need to covert it first
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance.
Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
ab
Option A (HDFS): Requires persistent EMR clusters, increasing costs compared to S3.
Option C (x86-based instances): Less cost-efficient than Graviton instances for Spark workloads.
Option E (Spot Instances for primary nodes): Not recommended for primary nodes, as Spot Instances can be interrupted, affecting reliability.
A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?
d
Option D (AWS Glue + Python dedupe library): Still requires custom code, making it less efficient than FindMatches.
A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure on a daily basis.
Which solution will meet this requirement in the MOST cost-effective way?
b
A company stores details about transactions in an Amazon S3 bucket. The company wants to log all writes to the S3 bucket into another S3 bucket that is in the same AWS Region.
Which solution will meet this requirement with the LEAST operational effort?
a