nvidia ncp-aio practice test

AI Operations

Last exam update: Nov 18 ,2025
Page 1 out of 5. Viewing questions 1-15 out of 66

Question 1

A Slurm user needs to submit a batch job script for execution tomorrow.
Which command should be used to complete this task?

  • A. sbatch -begin=tomorrow
  • B. submit -begin=tomorrow
  • C. salloc -begin=tomorrow
  • D. srun -begin=tomorrow
Mark Question:
Answer:

A


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
In Slurm cluster administration, the command to submit a batch job script is sbatch. This command
schedules the job to be executed by the Slurm workload manager. The option -begin=tomorrow (or --
begin=tomorrow) specifies the start time for the job execution, which in this case is set for tomorrow.
The other commands have different purposes:
submit is not a valid Slurm command.
salloc is used to allocate resources interactively but does not submit batch jobs for scheduled
execution.
srun runs jobs immediately on allocated resources but is typically used to launch tasks in an active
job or interactively, not for batch job submission.
Therefore, the correct command to submit a batch job script for future execution is sbatch -
begin=tomorrow.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 2

You are configuring networking for a new AI cluster in your data center. The cluster will handle large-
scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?

  • A. Implement a leaf-spine network topology using standard Ethernet switches to ensure scalability as more nodes are added.
  • B. Prioritize out-of-band management networks over compute networks to ensure efficient job scheduling across nodes.
  • C. Use standard Ethernet networking with a focus on increasing bandwidth through multiple connections per server.
  • D. Use InfiniBand networking to provide low-latency, high-throughput communication between servers in the cluster.
Mark Question:
Answer:

D


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
For large-scale AI workloads such as distributed training of large language models, the networking
infrastructure must deliver extremely low latency and very high throughput to keep GPUs and
compute nodes efficiently synchronized. NVIDIA highlights that InfiniBand networking is essential in
AI data centers because it provides ultra-low latency, high bandwidth, adaptive routing, congestion
control, and noise isolation—features critical for high-performance AI training clusters.
InfiniBand acts not just as a network but as a computing fabric, integrating compute and
communication tightly. Microsoft Azure, a leading cloud provider, uses thousands of miles of
InfiniBand cabling to meet the demands of their AI workloads, demonstrating its importance. While
Ethernet-based solutions like NVIDIA’s Spectrum-X are emerging and optimized for AI, InfiniBand
remains the premier choice for AI supercomputing networks.
Therefore, for maximizing performance in a new AI cluster focused on distributed training, InfiniBand
networking (option D) is the recommended architecture. Other Ethernet-based approaches provide
scalability and bandwidth but cannot match InfiniBand’s specialized low-latency and high-throughput
performance for AI.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 3

A system administrator needs to optimize the delivery of their AI applications to the edge.
What NVIDIA platform should be used?

  • A. Base Command Platform
  • B. Base Command Manager
  • C. Fleet Command
  • D. NetQ
Mark Question:
Answer:

C


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA Fleet Command is the platform designed specifically to optimize and manage the
deployment and delivery of AI applications at the edge. It enables secure and scalable orchestration
of AI workloads across distributed edge devices, providing lifecycle management, remote
monitoring, and updates. Fleet Command facilitates running AI applications closer to where data is
generated (edge), improving latency and operational efficiency.
Base Command Platform and Base Command Manager primarily target data center and AI cluster
management for configuration, monitoring, and troubleshooting.
NetQ is focused on network telemetry and network state monitoring rather than application
delivery.
Therefore, for AI application delivery and optimization at the edge, Fleet Command is the
recommended NVIDIA platform.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 4

A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING”
state and unable to progress to the “RUNNING” state.
Which Slurm command can help the user identify the reason for the job’s pending status?

  • A. sinfo -R
  • B. scontrol show job <jobid>
  • C. sacct -j <job[.step]>
  • D. squeue -u <user_list>
Mark Question:
Answer:

B


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command scontrol show job <jobid> provides detailed information about a specific job,
including its current status and, crucially, the reason why a job might be pending. This command
shows job details such as resource requirements, dependencies, and any issues blocking the job from
running.
sinfo -R displays information about nodes and their reasons for being in various states but does not
provide job-specific reasons.
sacct -j shows accounting data for jobs but typically does not explain pending causes.
squeue -u lists jobs by user but does not detail the pending reasons.
Hence, scontrol show job <jobid> is the appropriate command to diagnose why a Slurm job remains
in the pending state.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 5

You are a Solutions Architect designing a data center infrastructure for a cloud-based AI application
that requires high-performance networking, storage, and security. You need to choose a software
framework to program the NVIDIA BlueField DPUs that will be used in the infrastructure. The
framework must support the development of custom applications and services, as well as enable
tailored solutions for specific workloads. Additionally, the framework should allow for the integration
of storage services such as NVMe over Fabrics (NVMe-oF) and elastic block storage.
Which framework should you choose?

  • A. NVIDIA TensorRT
  • B. NVIDIA CUDA
  • C. NVIDIA NSight
  • D. NVIDIA DOCA
Mark Question:
Answer:

D


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA DOCA (Data Center Infrastructure-on-a-Chip Architecture) is the software framework
designed to program NVIDIA BlueField DPUs (Data Processing Units). DOCA provides libraries, APIs,
and tools to develop custom applications, enabling users to offload, accelerate, and secure data
center infrastructure functions on BlueField DPUs.
DOCA supports integration with key data center services including storage protocols such as NVMe
over Fabrics (NVMe-oF), elastic block storage, and network security and telemetry. It enables tailored
solutions optimized for specific workloads and high-performance infrastructure demands.
TensorRT is focused on AI inference optimization.
CUDA is NVIDIA’s GPU programming model for general-purpose GPU computing, not for DPUs.
NSight is a development environment for debugging and profiling NVIDIA GPUs.
Therefore, NVIDIA DOCA is the correct framework for programming BlueField DPUs in a data center
environment requiring custom application development and advanced storage/networking
integration.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 6

You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of
GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as
display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?

  • A. Verify that the GPUs are correctly listed in both gres.conf and slurm.conf, and ensure that unconfigured GPUs are excluded.
  • B. Use nvidia-smi to manually assign GPUs to each job before submission.
  • C. Reinstall the NVIDIA drivers to ensure proper GPU detection by Slurm.
  • D. Increase the number of GPUs requested in the job script to avoid using unconfigured GPUs.
Mark Question:
Answer:

A


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
In Slurm GPU resource management, the gres.conf file defines the available GPUs (generic resources)
per node, while slurm.conf configures the cluster-wide GPU scheduling policies. To prevent jobs from
using GPUs reserved for other purposes (e.g., display rendering GPUs), administrators must ensure
that only the GPUs intended for compute workloads are listed in these configuration files.
Properly configuring gres.conf allows Slurm to recognize and expose only those GPUs meant for jobs.
slurm.conf must be aligned to exclude or restrict unconfigured GPUs.
Manual GPU assignment using nvidia-smi is not scalable or integrated with Slurm scheduling.
Reinstalling drivers or increasing GPU requests does not solve resource exclusion.
Thus, the correct approach is to verify and configure GPU listings accurately in gres.conf and
slurm.conf to restrict job allocations to intended GPUs.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 7

A data scientist is training a deep learning model and notices slower than expected training times.
The data scientist alerts a system administrator to inspect the issue. The system administrator
suspects the disk IO is the issue.
What command should be used?

  • A. tcpdump
  • B. iostat
  • C. nvidia-smi
  • D. htop
Mark Question:
Answer:

B


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To diagnose disk IO performance issues, the system administrator should use the iostat command,
which reports CPU statistics and input/output statistics for devices and partitions. It helps identify
bottlenecks in disk throughput or latency affecting application performance.
tcpdump is used for network traffic analysis, not disk IO.
nvidia-smi monitors NVIDIA GPU status but not disk IO.
htop shows CPU, memory, and process usage but provides limited disk IO details.
Therefore, iostat is the appropriate tool to assess disk IO performance and diagnose bottlenecks
impacting training times.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 8

You have noticed that users can access all GPUs on a node even when they request only one GPU in
their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage.
What configuration change would you make to restrict users’ access to only their allocated GPUs?

  • A. Increase the memory allocation per job to limit access to other resources on the node.
  • B. Enable cgroup enforcement in cgroup.conf by setting ConstrainDevices=yes.
  • C. Set a higher priority for Jobs requesting fewer GPUs, so they finish faster and free up resources sooner.
  • D. Modify the job script to include additional resource requests for CPU cores alongside GPUs.
Mark Question:
Answer:

B


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To restrict users’ access strictly to the GPUs allocated to their jobs, Slurm uses cgroups (control
groups) for resource isolation. Enabling device cgroup enforcement by setting ConstrainDevices=yes
in cgroup.conf enforces device access restrictions, ensuring jobs cannot access GPUs beyond those
assigned.
Increasing memory allocation or setting job priorities does not restrict device access.
Modifying job scripts to request additional CPU cores does not limit GPU access.
Hence, enabling cgroup enforcement with ConstrainDevices=yes is the correct method to prevent
users from accessing unallocated GPUs.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 9

A new researcher needs access to GPU resources but should not have permission to modify cluster
settings or manage other users.
What role should you assign them in Run:ai?

  • A. L1 Researcher
  • B. Department Administrator
  • C. Application Administrator
  • D. Research Manager
Mark Question:
Answer:

A


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
In Run:ai, roles are assigned based on levels of permissions. The L1 Researcher role is designed for
users who need access to GPU resources for running jobs and experiments but should not have
administrative rights over cluster settings or other users. This role ensures researchers can use
resources without affecting cluster configurations or user management. Other roles like Department
Administrator, Application Administrator, or Research Manager have broader privileges, including
managing users and settings, which are not appropriate for the new researcher’s requirements.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 10

When troubleshooting Slurm job scheduling issues, a common source of problems is jobs getting
stuck in a pending state indefinitely.
Which Slurm command can be used to view detailed information about all pending jobs and identify
the cause of the delay?

  • A. scontrol
  • B. sacct
  • C. sinfo
Mark Question:
Answer:

A


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command scontrol provides detailed job control and information capabilities. Using
scontrol (e.g., scontrol show job <jobid>) can reveal comprehensive details about jobs, including
pending jobs, and the specific reasons why they are delayed or blocked. It is the go-to command for
in-depth troubleshooting of job states. While sacct provides accounting information and sinfo
displays node and partition status, neither provides as detailed or actionable information on pending
job causes as scontrol.

User Votes:
A
50%
B
50%
C
50%
Discussions
vote your answer:
A
B
C
0 / 1000

Question 11

What must be done before installing new versions of DOCA drivers on a BlueField DPU?

  • A. Uninstall any previous versions of DOCA drivers.
  • B. Re-flash the firmware every time.
  • C. Disable network interfaces during installation.
  • D. Reboot the host system.
Mark Question:
Answer:

A


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Before installing new versions of DOCA drivers on NVIDIA BlueField DPUs, it is required to uninstall
any previous versions of DOCA drivers to prevent conflicts and ensure a clean upgrade. This ensures
that the new installation is not affected by leftover files or configurations from earlier versions. Re-
flashing firmware or disabling network interfaces is not always required before every driver
installation. Rebooting the host system might be recommended after installation but is not a
prerequisite before installing drivers.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 12

A Slurm user needs to display real-time information about the running processes and resource usage
of a Slurm job.
Which command should be used?

  • A. smap -j <jobid>
  • B. scontrol show job <jobid>
  • C. sstat -j <job(.step)>
  • D. sinfo -j <jobid>
Mark Question:
Answer:

C


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command sstat is designed to provide real-time statistics about running jobs, including
process-level details and resource usage such as CPU, memory, and GPU utilization. Using sstat -j
<jobid> or sstat -j <jobid.step> allows monitoring of active job resource consumption.
smap is not a standard Slurm command.
scontrol show job gives job configuration and status but not real-time resource usage.
sinfo displays node and partition information, not job-specific resource stats.
Therefore, sstat is the correct command for real-time job process and resource monitoring.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 13

Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from
the GPU Operator in the public NGC catalog? (Choose two.)

  • A. It is configured to use a prebuilt vGPU driver image.
  • B. It supports Mixed Strategies for Kubernetes deployments.
  • C. It automatically installs the NVIDIA Datacenter driver.
  • D. It is configured to use the NVIDIA License System (NLS).
  • E. It additionally installs Network Operator.
Mark Question:
Answer:

A, D


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The pre-configured GPU Operator in the NVIDIA Enterprise Catalog differs from the public NGC
catalog GPU Operator primarily by its configuration to use a prebuilt vGPU driver image and being
configured to use the NVIDIA License System (NLS). These adaptations allow better support for
enterprise environments where vGPU functionality and license management are critical.
Other options such as automatic installation of the Datacenter driver or additional installation of
Network Operator are not specific differences highlighted between the two operators.

User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
vote your answer:
A
B
C
D
E
0 / 1000

Question 14

You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure
that each AI application running on the same GPU is isolated from others to prevent interference.
Which feature of Fleet Command should you use to achieve this?

  • A. Remote Console
  • B. Secure NFS support
  • C. Multi-Instance GPU (MIG) support
  • D. Over-the-air updates
Mark Question:
Answer:

C


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA Fleet Command is a cloud-native software platform designed to deploy, manage, and
orchestrate AI applications at the edge. When managing multiple AI applications on the same GPU,
Multi-Instance GPU (MIG) support is critical. MIG allows a single GPU to be partitioned into multiple
independent instances, each with dedicated resources (compute, memory, bandwidth), enabling
workload isolation and preventing interference between applications.
Remote Console allows remote access for management but does not provide GPU resource isolation.
Secure NFS support is for secure network file system sharing, unrelated to GPU resource partitioning.
Over-the-air updates are for updating software remotely, not for GPU resource management.
Therefore, to ensure application isolation on the same GPU in Fleet Command environments,
enabling MIG support (option C) is the recommended and standard practice.
This capability is emphasized in NVIDIA’s AI Operations and Fleet Command documentation for
managing edge AI deployments efficiently and securely.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 15

You are deploying AI applications at the edge and want to ensure they continue running even if one
of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?

  • A. Use Secure NFS support for data redundancy.
  • B. Set up over-the-air updates to automatically restart failed applications.
  • C. Enable high availability for edge clusters.
  • D. Configure Fleet Command's multi-instance GPU (MIG) to handle failover.
Mark Question:
Answer:

C


Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To ensure continued operation of AI applications at the edge despite server failures, NVIDIA Fleet
Command allows administrators to enable high availability (HA) for edge clusters. This HA
configuration ensures redundancy and failover capabilities, so applications remain operational when
an edge server goes down.
Over-the-air updates handle software patching but do not inherently provide failover. MIG manages
GPU resource partitioning, not failover. Secure NFS supports storage redundancy but is not the
primary solution for application failover.

User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
vote your answer:
A
B
C
D
0 / 1000
To page 2