databricks certified data engineer professional practice test

certified data engineer professional

Last exam update: May 17 ,2024
Page 1 out of 11. Viewing questions 1-10 out of 110

Question 1

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.
Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

  • A. Total VMs; 1 400 GB per Executor 160 Cores / Executor
  • B. Total VMs: 8 50 GB per Executor 20 Cores / Executor
  • C. Total VMs: 16 25 GB per Executor 10 Cores/Executor
  • D. Total VMs: 4 100 GB per Executor 40 Cores/Executor
  • E. Total VMs:2 200 GB per Executor 80 Cores / Executor
Mark Question:
Answer:

b

User Votes:
A
50%
B 1 votes
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 2

You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9)

Which kind of test would the above line exemplify?

  • A. Unit
  • B. Manual
  • C. Functional
  • D. Integration
  • E. End-to-end
Mark Question:
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 3

Which of the following is true of Delta Lake and the Lakehouse?

  • A. Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
  • B. Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
  • C. Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
  • D. Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
  • E. Z-order can only be applied to numeric values stored in Delta Lake tables.
Mark Question:
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 4

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?

  • A. Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9
  • B. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
  • C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
  • D. Merge all changes back to the main branch in the remote Git repository and clone the repo again
  • E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository
Mark Question:
Answer:

b

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 5

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:



An analyst who is not a member of the auditing group executes the following query:

SELECT * FROM user_ltv_no_minors

Which statement describes the results returned by this query?

  • A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
  • B. All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv.
  • C. All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
  • D. All records from all columns will be displayed with the values in user_ltv.
  • E. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.
Mark Question:
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E 1 votes
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 6

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

  • A. The five Minute Load Average remains consistent/flat
  • B. Bytes Received never exceeds 80 million bytes per second
  • C. Network I/O never spikes
  • D. Total Disk Space remains constant
  • E. CPU Utilization is around 75%
Mark Question:
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 7

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.

Which statement describes why the cloned tables are no longer working?

  • A. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
  • B. Running VACUUM automatically invalidates any shallow clones of a table; DEEP CLONE should always be used when a cloned table will be repeatedly queried.
  • C. Tables created with SHALLOW CLONE are automatically deleted after their default retention threshold of 7 days.
  • D. The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command.
  • E. The data files compacted by VACUUM are not tracked by the cloned metadata; running REFRESH on the cloned table will pull in recent changes.
Mark Question:
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 8

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

  • A. "Can Manage" privileges on the required cluster
  • B. Workspace Admin privileges, cluster creation allowed, "Can Attach To" privileges on the required cluster
  • C. Cluster creation allowed, "Can Attach To" privileges on the required cluster
  • D. "Can Restart" privileges on the required cluster
  • E. Cluster creation allowed, "Can Restart" privileges on the required cluster
Mark Question:
Answer:

a

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 9

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

  • A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  • B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  • C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  • D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  • E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
Mark Question:
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 10

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:



Choose the response that correctly fills in the blank within the code block to complete this task.

  • A. withWatermark("event_time", "10 minutes")
  • B. awaitArrival("event_time", "10 minutes")
  • C. await("event_time + 10 minutes'")
  • D. slidingWindow("event_time", "10 minutes")
  • E. delayWrite("event_time", "10 minutes")
Mark Question:
Answer:

d

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000
To page 2