Getting started on the NVIDIA GB200 NVL72

Created by Cesar Arias, Modified on Fri, 22 May at 2:24 AM by Cesar Arias

 GB200 NVL72 SuperPOD 

Getting started on the NVIDIA GB200 NVL72 4-rack cluster

This guide combines NVL72 platform context, access instructions, topology-aware scheduling guidance, workload selection advice, and practical Slurm job examples for both non-containerized and Pyxis+Enroot containerized workflows on the beta partition.

Table of contents


Environment overview

The GB200 NVL72 SuperPOD is a multi-rack cluster built from Grace-based nodes with tightly integrated B200 GPUs and a high-bandwidth NVLink/NVSwitch fabric, exposed to users through the beta partition for large-scale AI and HPC workloads.

  • Designed for very large, multi-node jobs that need strong scaling across many GPUs.
  • Often paired with a dedicated software stack and container images tuned for GB200/B200 systems.

Recommended usage: Use smaller environments for early development, then promote stable workflows to beta once they are validated and ready to scale.


4-rack cluster totals

The table below gives a quick system level view of the NVL72 deployment and helps explain why placement, scale, and scheduling policy matter on this platform.

ResourcePer Rack4-Rack Cluster Total
B200 GPUs72288 GPUs
Grace CPUs36144 CPUs (10,368 ARM cores)
Compute nodes (trays)1872 nodes
GPU memory (HBM3e)13.4 TB~53.6 TB
NVLink domains14 independent rack domains
Scale-out NICs72 x ConnectX-8800 Gb/s each; 57.6 Tb/s per rack

Segment based topology aware scheduling

On NVL72 systems configured with Slurm block scheduling, --segment is the user-facing way to express how many nodes should stay grouped together within the topology plan. This gives users a way to trade off locality against scheduling flexibility.

  • Smaller segments: more flexible scheduling, often faster starts, but less strict locality.
  • Larger segments: stronger locality for communication heavy jobs, but potentially longer queue times.
  • Important: --segment does not create topology by itself; it relies on the cluster being configured with Slurm block topology by the administrators.
Segment sizeTypical meaningRecommended use case
--segment=1Maximum placement flexibilityIndependent work, smoke tests, loosely coupled jobs
--segment=4Keep nodes grouped in sets of 4Small distributed scale-out tests
--segment=8Larger grouped placementMedium distributed training jobs
--segment=16Strong locality across large groupsCommunication-heavy larger jobs
# Example: 4-node job with 4-node segment grouping
sbatch --partition=beta --nodes=4 --gpus-per-node=4 --segment=4 my_job.sh

# Example: 8-node job with 8-node segment grouping
sbatch --partition=beta --nodes=8 --gpus-per-node=4 --segment=8 train.sh

Workload use-case matrix

Use this matrix to map the shape of the workload to a recommended launch pattern on beta.

Workload patternTypical sizeSuggested execution style
Environment smoke test1 node, 1-4 GPUsBare metal or container
Single node fine-tuning1 node, 4 GPUsPrefer container
Data parallel scale-out test2-4 nodes, 4 GPUs per nodeContainer strongly recommended
Tensor or pipeline parallel training4-16 nodesContainer strongly recommended
Large production training16+ nodesContainer required

Should I run non-containerized jobs on NVL72?

It is technically possible to run non-containerized jobs on an NVL72 SuperPOD if required modules and libraries are available directly on the host OS. In practice, most modern deployments strongly encourage or require containerized workflows for reproducibility, environment control, and operational consistency.

  • Reason to use containers: guarantee that your framework, CUDA, NCCL, and dependencies match the GB200 NVL72 environment and remain reproducible across runs and users.
  • Reason to allow bare-metal jobs: quick validation runs, diagnostics, or vendor provided examples that match the host environment exactly.
  • Risk of bare-metal work: library conflicts, harder debugging, and less predictable behavior on a large shared system.

Practical guidance: for a 4-rack NVL72 SuperPOD, treat bare-metal jobs as the exception, not the norm. Prefer Pyxis+Enroot for all multi-node or long-running workloads.


Job submission guide and best practices

Job submission on NVL72 should be deliberate because placement quality, software consistency, and wall-time strategy all affect both job reliability and cluster efficiency.

Understanding segment based placement

Use smaller --segment values for flexible, lightly coupled jobs and larger values for communication heavy distributed training. Start with the site default when uncertain, then tighten the segment size only when scaling data shows a clear benefit.

Freeze the environment

For production jobs, reuse the same container image tag once it has been validated. Only move to a new tag when dependencies change intentionally, such as upgrading CUDA, PyTorch, NCCL, or project libraries.


# Good: explicit, versioned image tags
registry.example.org/gb200/nvl72-pytorch:1.0
registry.example.org/gb200/nvl72-pytorch:1.1

# Avoid for production if it changes underneath you
registry.example.org/gb200/nvl72-pytorch:latest

Example 1 - Simple non-container batch job

#!/bin/bash
#SBATCH --job-name=nvl72-smoke-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=00:10:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-smoke-4gpu-%j.out
#SBATCH --error=nvl72-smoke-4gpu-%j.err

module load cuda
module load pytorch
nvidia-smi
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

Example 2 - Multi-GPU training (bare metal)

#!/bin/bash
#SBATCH --job-name=nvl72-train-bare-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=02:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-train-bare-4gpu-%j.out
#SBATCH --error=nvl72-train-bare-4gpu-%j.err

module load cuda
module load pytorch

srun torchrun   --nnodes=1   --nproc_per_node=4   train.py --config configs/nvl72_single_4gpu.yaml

Example 3 - Segment-aware multi-node test

#!/bin/bash
#SBATCH --job-name=nvl72-segment-test
#SBATCH --partition=beta
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=01:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-segment-test-%j.out
#SBATCH --error=nvl72-segment-test-%j.err

srun --segment=4 torchrun   --nnodes=${SLURM_JOB_NUM_NODES}   --nproc_per_node=${SLURM_GPUS_PER_NODE}   --rdzv_backend=c10d   --rdzv_endpoint=${SLURM_NODELIST}:29500   train_distributed.py --config configs/nvl72_scale_test.yaml

Pyxis + Enroot basics

Pyxis is a Slurm plugin that integrates Enroot containers into job submission, allowing you to launch GPU-accelerated containers directly from srun or sbatch without wrapper scripts.


Example 4 - Single-node Pyxis+Enroot job

#!/bin/bash
#SBATCH --job-name=nvl72-container-single-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=04:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-single-4gpu-%j.out
#SBATCH --error=nvl72-container-single-4gpu-%j.err

CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>

srun   --container-image="${CONTAINER_IMAGE}"   --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data"   --container-remap-root   bash -lc "cd /workspace; torchrun --nnodes=1 --nproc_per_node=4 train.py --config configs/nvl72_container_4gpu.yaml --data-dir /data"

Example 5 - Import or prepare an Enroot image

enroot import docker://registry.example.org/gb200/nvl72-pytorch:1.0
enroot create --name nvl72-pytorch-1.0 registry.example.org+gb200+nvl72-pytorch+1.0.sqsh
enroot list

Example 6 - Multi-node distributed job in a container

#!/bin/bash
#SBATCH --job-name=nvl72-container-ddp
#SBATCH --partition=beta
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=12:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-ddp-%j.out
#SBATCH --error=nvl72-container-ddp-%j.err

CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>

srun --segment=8   --container-image="${CONTAINER_IMAGE}"   --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data"   --container-remap-root   bash -lc "
    cd /workspace;
    NUM_NODES=\${SLURM_JOB_NUM_NODES};
    GPUS_PER_NODE=\${SLURM_GPUS_PER_NODE};
    torchrun \
      --nnodes=\${NUM_NODES} \
      --nproc_per_node=\${GPUS_PER_NODE} \
      --rdzv_backend=c10d \
      --rdzv_endpoint=\${SLURM_NODELIST}:29500 \
      train_distributed.py --config configs/nvl72_large.yaml --data-dir /data --log-dir /workspace/logs/nvl72
  "

Container workflow: build - test - scale

  1. Build or choose a GB200 compatible image.
  2. Test on a single node with 4 GPUs.
  3. Scale to a small multi-node run.
  4. Scale to production.
  5. Freeze the environment. Reuse the same validated image tag for production.

Best practices on NVL72

  • Prefer containers: use Pyxis+Enroot as the default for NVL72.
  • Use non-container jobs sparingly: limit bare-metal jobs to quick checks.
  • Test small, then scale: validate workflows on 1 node / 4 GPUs before scaling.
  • Use segment intentionally: increase segment size only when the workload benefits from stronger locality.
  • Checkpoint regularly: long running distributed jobs should save state frequently.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article