Getting started on the NVIDIA GB200 NVL72

Created by Cesar Arias, Modified on Fri, 22 May at 2:24 AM by Cesar Arias

GB200 NVL72 SuperPOD

Getting started on the NVIDIA GB200 NVL72 4-rack cluster

This guide combines NVL72 platform context, access instructions, topology-aware scheduling guidance, workload selection advice, and practical Slurm job examples for both non-containerized and Pyxis+Enroot containerized workflows on the beta partition.

Table of contents

Environment overview
4-rack cluster totals
Segment-based topology-aware scheduling
Workload use-case matrix
Should I run non-containerized jobs?
Job submission guide and best practices
Example 1 - Simple non-container batch job
Example 2 - Multi-GPU training (bare metal)
Example 3 - Segment-aware multi-node test
Pyxis + Enroot basics
Example 4 - Single-node Pyxis+Enroot job
Example 5 - Import or prepare an Enroot image
Example 6 - Multi-node distributed job in a container
Container workflow: build - test - scale
Best practices on NVL72

Environment overview

The GB200 NVL72 SuperPOD is a multi-rack cluster built from Grace-based nodes with tightly integrated B200 GPUs and a high-bandwidth NVLink/NVSwitch fabric, exposed to users through the beta partition for large-scale AI and HPC workloads.

Designed for very large, multi-node jobs that need strong scaling across many GPUs.
Often paired with a dedicated software stack and container images tuned for GB200/B200 systems.

Recommended usage: Use smaller environments for early development, then promote stable workflows to beta once they are validated and ready to scale.

4-rack cluster totals

The table below gives a quick system level view of the NVL72 deployment and helps explain why placement, scale, and scheduling policy matter on this platform.

Resource	Per Rack	4-Rack Cluster Total
B200 GPUs	72	288 GPUs
Grace CPUs	36	144 CPUs (10,368 ARM cores)
Compute nodes (trays)	18	72 nodes
GPU memory (HBM3e)	13.4 TB	~53.6 TB
NVLink domains	1	4 independent rack domains
Scale-out NICs	72 x ConnectX-8	800 Gb/s each; 57.6 Tb/s per rack

Segment based topology aware scheduling

On NVL72 systems configured with Slurm block scheduling, --segment is the user-facing way to express how many nodes should stay grouped together within the topology plan. This gives users a way to trade off locality against scheduling flexibility.

Smaller segments: more flexible scheduling, often faster starts, but less strict locality.
Larger segments: stronger locality for communication heavy jobs, but potentially longer queue times.
Important: --segment does not create topology by itself; it relies on the cluster being configured with Slurm block topology by the administrators.

Segment size	Typical meaning	Recommended use case
`--segment=1`	Maximum placement flexibility	Independent work, smoke tests, loosely coupled jobs
`--segment=4`	Keep nodes grouped in sets of 4	Small distributed scale-out tests
`--segment=8`	Larger grouped placement	Medium distributed training jobs
`--segment=16`	Strong locality across large groups	Communication-heavy larger jobs

# Example: 4-node job with 4-node segment grouping
sbatch --partition=beta --nodes=4 --gpus-per-node=4 --segment=4 my_job.sh

# Example: 8-node job with 8-node segment grouping
sbatch --partition=beta --nodes=8 --gpus-per-node=4 --segment=8 train.sh

Workload use-case matrix

Use this matrix to map the shape of the workload to a recommended launch pattern on beta.

Workload pattern	Typical size	Suggested execution style
Environment smoke test	1 node, 1-4 GPUs	Bare metal or container
Single node fine-tuning	1 node, 4 GPUs	Prefer container
Data parallel scale-out test	2-4 nodes, 4 GPUs per node	Container strongly recommended
Tensor or pipeline parallel training	4-16 nodes	Container strongly recommended
Large production training	16+ nodes	Container required

Should I run non-containerized jobs on NVL72?

It is technically possible to run non-containerized jobs on an NVL72 SuperPOD if required modules and libraries are available directly on the host OS. In practice, most modern deployments strongly encourage or require containerized workflows for reproducibility, environment control, and operational consistency.

Reason to use containers: guarantee that your framework, CUDA, NCCL, and dependencies match the GB200 NVL72 environment and remain reproducible across runs and users.
Reason to allow bare-metal jobs: quick validation runs, diagnostics, or vendor provided examples that match the host environment exactly.
Risk of bare-metal work: library conflicts, harder debugging, and less predictable behavior on a large shared system.

Practical guidance: for a 4-rack NVL72 SuperPOD, treat bare-metal jobs as the exception, not the norm. Prefer Pyxis+Enroot for all multi-node or long-running workloads.

Job submission guide and best practices

Job submission on NVL72 should be deliberate because placement quality, software consistency, and wall-time strategy all affect both job reliability and cluster efficiency.

Understanding segment based placement

Use smaller --segment values for flexible, lightly coupled jobs and larger values for communication heavy distributed training. Start with the site default when uncertain, then tighten the segment size only when scaling data shows a clear benefit.

Freeze the environment

For production jobs, reuse the same container image tag once it has been validated. Only move to a new tag when dependencies change intentionally, such as upgrading CUDA, PyTorch, NCCL, or project libraries.

# Good: explicit, versioned image tags
registry.example.org/gb200/nvl72-pytorch:1.0
registry.example.org/gb200/nvl72-pytorch:1.1

# Avoid for production if it changes underneath you
registry.example.org/gb200/nvl72-pytorch:latest

Example 1 - Simple non-container batch job

#!/bin/bash
#SBATCH --job-name=nvl72-smoke-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=00:10:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-smoke-4gpu-%j.out
#SBATCH --error=nvl72-smoke-4gpu-%j.err

module load cuda
module load pytorch
nvidia-smi
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

Example 2 - Multi-GPU training (bare metal)

#!/bin/bash
#SBATCH --job-name=nvl72-train-bare-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=02:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-train-bare-4gpu-%j.out
#SBATCH --error=nvl72-train-bare-4gpu-%j.err

module load cuda
module load pytorch

srun torchrun   --nnodes=1   --nproc_per_node=4   train.py --config configs/nvl72_single_4gpu.yaml

Example 3 - Segment-aware multi-node test

#!/bin/bash
#SBATCH --job-name=nvl72-segment-test
#SBATCH --partition=beta
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=01:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-segment-test-%j.out
#SBATCH --error=nvl72-segment-test-%j.err

srun --segment=4 torchrun   --nnodes=${SLURM_JOB_NUM_NODES}   --nproc_per_node=${SLURM_GPUS_PER_NODE}   --rdzv_backend=c10d   --rdzv_endpoint=${SLURM_NODELIST}:29500   train_distributed.py --config configs/nvl72_scale_test.yaml

Pyxis + Enroot basics

Pyxis is a Slurm plugin that integrates Enroot containers into job submission, allowing you to launch GPU-accelerated containers directly from srun or sbatch without wrapper scripts.

Example 4 - Single-node Pyxis+Enroot job

#!/bin/bash
#SBATCH --job-name=nvl72-container-single-4gpu
#SBATCH --partition=beta
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=04:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-single-4gpu-%j.out
#SBATCH --error=nvl72-container-single-4gpu-%j.err

CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>

srun   --container-image="${CONTAINER_IMAGE}"   --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data"   --container-remap-root   bash -lc "cd /workspace; torchrun --nnodes=1 --nproc_per_node=4 train.py --config configs/nvl72_container_4gpu.yaml --data-dir /data"

Example 5 - Import or prepare an Enroot image

enroot import docker://registry.example.org/gb200/nvl72-pytorch:1.0
enroot create --name nvl72-pytorch-1.0 registry.example.org+gb200+nvl72-pytorch+1.0.sqsh
enroot list

Example 6 - Multi-node distributed job in a container

#!/bin/bash
#SBATCH --job-name=nvl72-container-ddp
#SBATCH --partition=beta
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=12:00:00
#SBATCH --account=<your_account>
#SBATCH --output=nvl72-container-ddp-%j.out
#SBATCH --error=nvl72-container-ddp-%j.err

CONTAINER_IMAGE=registry.example.org/gb200/nvl72-pytorch:1.0
PROJECT_ROOT=/project/<proj>
DATA_ROOT=/datasets/<name>

srun --segment=8   --container-image="${CONTAINER_IMAGE}"   --container-mounts="${PROJECT_ROOT}:/workspace,${DATA_ROOT}:/data"   --container-remap-root   bash -lc "
    cd /workspace;
    NUM_NODES=\${SLURM_JOB_NUM_NODES};
    GPUS_PER_NODE=\${SLURM_GPUS_PER_NODE};
    torchrun \
      --nnodes=\${NUM_NODES} \
      --nproc_per_node=\${GPUS_PER_NODE} \
      --rdzv_backend=c10d \
      --rdzv_endpoint=\${SLURM_NODELIST}:29500 \
      train_distributed.py --config configs/nvl72_large.yaml --data-dir /data --log-dir /workspace/logs/nvl72
  "

Container workflow: build - test - scale

Build or choose a GB200 compatible image.
Test on a single node with 4 GPUs.
Scale to a small multi-node run.
Scale to production.
Freeze the environment. Reuse the same validated image tag for production.

Best practices on NVL72

Prefer containers: use Pyxis+Enroot as the default for NVL72.
Use non-container jobs sparingly: limit bare-metal jobs to quick checks.
Test small, then scale: validate workflows on 1 node / 4 GPUs before scaling.
Use segment intentionally: increase segment size only when the workload benefits from stronger locality.
Checkpoint regularly: long running distributed jobs should save state frequently.