GB200 NVL72 SuperPOD
Getting started on the NVIDIA GB200 NVL72 4-rack cluster
This guide combines NVL72 platform context, access instructions, topology-aware scheduling guidance, workload selection advice, and practical Slurm job examples for both non-containerized and Pyxis+Enroot containerized workflows on the beta partition.
Table of contents
- Environment overview
- 4-rack cluster totals
- Segment-based topology-aware scheduling
- Workload use-case matrix
- Should I run non-containerized jobs?
- Job submission guide and best practices
- Example 1 - Simple non-container batch job
- Example 2 - Multi-GPU training (bare metal)
- Example 3 - Segment-aware multi-node test
- Pyxis + Enroot basics
- Example 4 - Single-node Pyxis+Enroot job
- Example 5 - Import or prepare an Enroot image
- Example 6 - Multi-node distributed job in a container
- Container workflow: build - test - scale
- Best practices on NVL72
Environment overview
The GB200 NVL72 SuperPOD is a multi-rack cluster built from Grace-based nodes with tightly integrated B200 GPUs and a high-bandwidth NVLink/NVSwitch fabric, exposed to users through the beta partition for large-scale AI and HPC workloads.
- Designed for very large, multi-node jobs that need strong scaling across many GPUs.
- Often paired with a dedicated software stack and container images tuned for GB200/B200 systems.
Recommended usage: Use smaller environments for early development, then promote stable workflows to beta once they are validated and ready to scale.
4-rack cluster totals
The table below gives a quick system level view of the NVL72 deployment and helps explain why placement, scale, and scheduling policy matter on this platform.
| Resource | Per Rack | 4-Rack Cluster Total |
|---|---|---|
| B200 GPUs | 72 | 288 GPUs |
| Grace CPUs | 36 | 144 CPUs (10,368 ARM cores) |
| Compute nodes (trays) | 18 | 72 nodes |
| GPU memory (HBM3e) | 13.4 TB | ~53.6 TB |
| NVLink domains | 1 | 4 independent rack domains |
| Scale-out NICs | 72 x ConnectX-8 | 800 Gb/s each; 57.6 Tb/s per rack |
Segment based topology aware scheduling
On NVL72 systems configured with Slurm block scheduling, --segment is the user-facing way to express how many nodes should stay grouped together within the topology plan. This gives users a way to trade off locality against scheduling flexibility.
- Smaller segments: more flexible scheduling, often faster starts, but less strict locality.
- Larger segments: stronger locality for communication heavy jobs, but potentially longer queue times.
- Important:
--segmentdoes not create topology by itself; it relies on the cluster being configured with Slurm block topology by the administrators.
| Segment size | Typical meaning | Recommended use case |
|---|---|---|
--segment=1 | Maximum placement flexibility | Independent work, smoke tests, loosely coupled jobs |
--segment=4 | Keep nodes grouped in sets of 4 | Small distributed scale-out tests |
--segment=8 | Larger grouped placement | Medium distributed training jobs |
--segment=16 | Strong locality across large groups | Communication-heavy larger jobs |
# Example: 4-node job with 4-node segment grouping
sbatch --partition=beta --nodes=4 --gpus-per-node=4 --segment=4 my_job.sh
# Example: 8-node job with 8-node segment grouping
sbatch --partition=beta --nodes=8 --gpus-per-node=4 --segment=8 train.shWorkload use-case matrix
Use this matrix to map the shape of the workload to a recommended launch pattern on beta.
| Workload pattern | Typical size | Suggested execution style |
|---|---|---|
| Environment smoke test | 1 node, 1-4 GPUs | Bare metal or container |
| Single node fine-tuning | 1 node, 4 GPUs | Prefer container |
| Data parallel scale-out test | 2-4 nodes, 4 GPUs per node | Container strongly recommended |
| Tensor or pipeline parallel training | 4-16 nodes | Container strongly recommended |
| Large production training | 16+ nodes | Container required |
Should I run non-containerized jobs on NVL72?
It is technically possible to run non-containerized jobs on an NVL72 SuperPOD if required modules and libraries are available directly on the host OS. In practice, most modern deployments strongly encourage or require containerized workflows for reproducibility, environment control, and operational consistency.
- Reason to use containers: guarantee that your framework, CUDA, NCCL, and dependencies match the GB200 NVL72 environment and remain reproducible across runs and users.
- Reason to allow bare-metal jobs: quick validation runs, diagnostics, or vendor provided examples that match the host environment exactly.
- Risk of bare-metal work: library conflicts, harder debugging, and less predictable behavior on a large shared system.
Practical guidance: for a 4-rack NVL72 SuperPOD, treat bare-metal jobs as the exception, not the norm. Prefer Pyxis+Enroot for all multi-node or long-running workloads.
Job submission guide and best practices
Job submission on NVL72 should be deliberate because placement quality, software consistency, and wall-time strategy all affect both job reliability and cluster efficiency.
Understanding segment based placement
Use smaller --segment values for flexible, lightly coupled jobs and larger values for communication heavy distributed training. Start with the site default when uncertain, then tighten the segment size only when scaling data shows a clear benefit.
Freeze the environment
For production jobs, reuse the same container image tag once it has been validated. Only move to a new tag when dependencies change intentionally, such as upgrading CUDA, PyTorch, NCCL, or project libraries.
# Good: explicit, versioned image tags
registry.example.org/gb200/nvl72-pytorch:1.0
registry.example.org/gb200/nvl72-pytorch:1.1
# Avoid for production if it changes underneath you
registry.example.org/gb200/nvl72-pytorch:latestExample 1 - Simple non-container batch job
#!/bin/bash
#SBATCH --job-name=nvl72-smoke-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=00:10:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-smoke-4gpu-%j.out
#SBATCH --error=nvl72-smoke-4gpu-%j.err
module load cuda
module load pytorch
nvidia-smi
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
Example 2 - Multi-GPU training (bare metal)
#!/bin/bash
#SBATCH --job-name=nvl72-train-bare-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=02:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-train-bare-4gpu-%j.out
#SBATCH --error=nvl72-train-bare-4gpu-%j.err
module load cuda
module load pytorch
srun torchrun --nnodes=1 --nproc_per_node=4 train.py --config configs/nvl72_single_4gpu.yaml
Example 3 - Segment-aware multi-node test
#!/bin/bash
#SBATCH --job-name=nvl72-segment-test
#SBATCH --partition=beta
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=01:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-segment-test-%j.out
#SBATCH --error=nvl72-segment-test-%j.err
srun --segment=4 torchrun --nnodes=${SLURM_JOB_NUM_NODES} --nproc_per_node=${SLURM_GPUS_PER_NODE} --rdzv_backend=c10d --rdzv_endpoint=${SLURM_NODELIST}:29500 train_distributed.py --config configs/nvl72_scale_test.yaml
Pyxis + Enroot basics
Pyxis is a Slurm plugin that integrates Enroot containers into job submission, allowing you to launch GPU-accelerated containers directly from srun or sbatch without wrapper scripts.
Example 4 - Single-node Pyxis+Enroot job
#!/bin/bash
#SBATCH --job-name=nvl72-container-single-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=04:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-single-4gpu-%j.out
#SBATCH --error=nvl72-container-single-4gpu-%j.err
CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>
srun --container-image="${CONTAINER_IMAGE}" --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data" --container-remap-root bash -lc "cd /workspace; torchrun --nnodes=1 --nproc_per_node=4 train.py --config configs/nvl72_container_4gpu.yaml --data-dir /data"
Example 5 - Import or prepare an Enroot image
enroot import docker://registry.example.org/gb200/nvl72-pytorch:1.0
enroot create --name nvl72-pytorch-1.0 registry.example.org+gb200+nvl72-pytorch+1.0.sqsh
enroot listExample 6 - Multi-node distributed job in a container
#!/bin/bash
#SBATCH --job-name=nvl72-container-ddp
#SBATCH --partition=beta
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=12:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-ddp-%j.out
#SBATCH --error=nvl72-container-ddp-%j.err
CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>
srun --segment=8 --container-image="${CONTAINER_IMAGE}" --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data" --container-remap-root bash -lc "
cd /workspace;
NUM_NODES=\${SLURM_JOB_NUM_NODES};
GPUS_PER_NODE=\${SLURM_GPUS_PER_NODE};
torchrun \
--nnodes=\${NUM_NODES} \
--nproc_per_node=\${GPUS_PER_NODE} \
--rdzv_backend=c10d \
--rdzv_endpoint=\${SLURM_NODELIST}:29500 \
train_distributed.py --config configs/nvl72_large.yaml --data-dir /data --log-dir /workspace/logs/nvl72
"Container workflow: build - test - scale
- Build or choose a GB200 compatible image.
- Test on a single node with 4 GPUs.
- Scale to a small multi-node run.
- Scale to production.
- Freeze the environment. Reuse the same validated image tag for production.
Best practices on NVL72
- Prefer containers: use Pyxis+Enroot as the default for NVL72.
- Use non-container jobs sparingly: limit bare-metal jobs to quick checks.
- Test small, then scale: validate workflows on 1 node / 4 GPUs before scaling.
- Use segment intentionally: increase segment size only when the workload benefits from stronger locality.
- Checkpoint regularly: long running distributed jobs should save state frequently.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article