google professional-data-engineer practice test

Professional Data Engineer on Google Cloud Platform


Question 1

You have uploaded 5 years of log data to Cloud Storage. A user reported that some data points in the log data are outside of
their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the
future while keeping the original data for compliance reasons. What should you do?

  • A. Import the data from Cloud Storage into BigQuery. Create a new BigQuery table, and skip the rows with errors.
  • B. Create a Compute Engine instance and create a new copy of the data in Cloud Storage. Skip the rows with errors.
  • C. Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.
  • D. Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage.
Answer:

A

Discussions

Question 2

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to
house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud
Dataflow pipeline for your project. How should you maintain users privacy?

  • A. Grant the consultant the Viewer role on the project.
  • B. Grant the consultant the Cloud Dataflow Developer role on the project.
  • C. Create a service account and allow the consultant to log on with it.
  • D. Create an anonymized sample of the data for the consultant to work with in a different project.
Answer:

C

Discussions

Question 3

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps
that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and
running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must
be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

  • A. Cloud Scheduler
  • B. Cloud Dataflow
  • C. Cloud Functions
  • D. Cloud Composer
Answer:

A

Discussions

Question 4

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets
stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use
cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you
do first?

  • A. Use Google Stackdriver Audit Logs to review data access.
  • B. Get the identity and access management IIAM) policy of each table
  • C. Use Stackdriver Monitoring to see the usage of BigQuery query slots.
  • D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Answer:

A

Discussions

Question 5

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support
transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you
do?

  • A. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
  • B. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
  • C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
  • D. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
Answer:

C

Discussions

Question 6

You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a
storage, backup, and recovery strategy of this data that minimizes cost. How should you configure the BigQuery table?

  • A. Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
  • B. Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
  • C. Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
  • D. Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
Answer:

B

Discussions

Question 7

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google
BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application
and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the
value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How
can you make that data available while minimizing cost?

  • A. Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
  • B. Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
  • C. Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.
  • D. Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.
Answer:

C

Discussions

Question 8

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to
perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools
in other cloud providers. What should you do?

  • A. Store and process the entire dataset in BigQuery.
  • B. Store and process the entire dataset in Cloud Bigtable.
  • C. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
  • D. Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.
Answer:

D

Discussions

Question 9

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate
small data centers around the world to capture these events, but leased lines that provide connectivity from your event
collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to
address this issue in the most cost-effective way. What should you do?

  • A. Deploy small Kafka clusters in your data centers to buffer events.
  • B. Have the data acquisition devices publish data to Cloud Pub/Sub.
  • C. Establish a Cloud Interconnect between all remote data centers and Google.
  • D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.
Answer:

A

Discussions

Question 10

Your company needs to upload their historic data to Cloud Storage. The security rules dont allow access from external IPs
to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every
day. What should they do?

  • A. Execute gsutil rsync from the on-premises servers.
  • B. Use Cloud Dataflow and write the data to Cloud Storage.
  • C. Write a job template in Cloud Dataproc to perform the data transfer.
  • D. Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.
Answer:

B

Discussions
To page 2