Empire AI Getting Started (Alpha, Grace, Beta)

Created by Cesar Arias, Modified on Fri, 22 May at 2:27 AM by Cesar Arias

Start Here

Empire AI Systems Guide — Choosing the Right System and Node Type

This guide is the starting point for choosing the right Empire AI environment. It gives a hardware-focused view of Alpha, Grace, and Beta so you can understand what is available, what each environment is best for, and how to think about node types, GPU types, CPU architecture, and Slurm targeting.

Related articles

High-level picture: Alpha is the current H100/H200 GPU environment, Grace provides CPU-only ARM nodes, and Beta NVL72 B200 resources are intended for the largest AI training and inference workloads.

Scheduler layout: Alpha and Grace share a common Slurm environment. The NVL72 B200 Beta nodes are in a different Slurm environment, so Beta jobs are handled separately from the shared Alpha and Grace scheduler.

Choosing the right system and node type

Alpha and Beta are designed to complement each other rather than replace one another. Within Alpha, you can choose between H100 and H200 nodes, and for the largest or most demanding workloads you can move to the separate Beta NVL72 B200 environment. Grace nodes fit alongside Alpha for CPU-heavy and ARM64 workflows.

Alpha H100 nodes (alphagpu01-alphagpu18) are a solid default for many training runs and experiments that fit within 80 GB of GPU memory per device. They are often a good starting point for model development, tuning, and smaller production workloads.
Alpha H200 nodes (alphagpu19-alphagpu24) provide 141 GB of GPU memory per device and are better suited to larger models or batch sizes, as well as training runs where you want to reduce activation checkpointing or host-to-device transfers.
Grace ARM nodes (betagg01-betagg60) are a good fit for pre- and post-processing, data preparation, software validation, and HPC-style CPU workloads that sit next to Alpha GPU workflows without consuming GPU time.
Beta NVL72 B200 nodes are intended for very large models and high-throughput inference at scale. If a workload already saturates Alpha or requires the newest Blackwell-generation AI platform, Beta is likely the better target.

In practice, researchers start on Alpha H100 nodes, move to H200 nodes when they need more memory headroom, place CPU-heavy work on Grace, and then transition the heaviest training and inference workloads to Beta once they are ready to scale out.

Recognizing node types in Slurm

Alpha and Grace share a common Slurm environment, so you select between those resources through partitions and the node ranges documented for each system. The current partition name for Grace is grace, while Alpha currently uses institutional partitions and is expected to transition to an alpha hardware-tier partition.

Alpha H100 nodes — currently accessed through your institutional Alpha partition, using the Alpha H100 node range alphagpu01-alphagpu18.
Alpha H200 nodes — currently accessed through your institutional Alpha partition, using the Alpha H200 node range alphagpu19-alphagpu24.
Grace nodes — accessed through the grace partition, using the Grace node range betagg01-betagg60 for CPU-only ARM64 work.

Beta NVL72 B200 nodes are handled through a separate Slurm environment rather than the shared Alpha and Grace scheduler. To see the current Alpha and Grace configuration, you can run commands such as sinfo to list partitions and node states.

Alpha hardware

Alpha is the general GPU environment for H100/H200-class workflows, interactive GPU debugging, model training, inference, and batch experimentation. The current GPU pool is 192 GPUs across 24 GPU nodes, 8 GPUs per node with 10 ConnectX-7 400Gb/s NICs, 30TB of NVMe caching space, and 2TB of system memory.

Node type	GPU / CPU resource	Partition	Notes
Alpha H100 nodes (`alphagpu01-alphagpu18`)	8× H100 80GB GPUs per node	Institutional Alpha partition for now, transitioning to `alpha`	Default starting point for many experiments and medium-scale training jobs
Alpha H200 nodes (`alphagpu19-alphagpu24`)	8× H200 GPUs per node, 141GB per GPU	Institutional Alpha partition for now, transitioning to `alpha`	Best when you need more GPU memory headroom or are staging larger jobs before Beta
`alphacpu01`	x86_64 CPU-only resource	`cpu`	Convenience CPU-only node for setup, lightweight work, and debugging

Grace hardware

Grace is the CPU-only ARM64 environment that shares the Alpha Slurm scheduler. It is intended for larger CPU-only jobs, preprocessing, data wrangling, memory-heavy workflows, and software validation on ARM64.

Node type	GPU / CPU resource	Partition	Notes
Grace ARM CPU nodes (`betagg01-betagg60`)	ARM64 / aarch64 CPU-only resources	`grace`	Good fit for preprocessing, feature extraction, simulation workloads, and CPU-heavy tasks that feed GPU jobs

Environment note: Grace is not just Alpha without GPUs. It is a different CPU architecture, so Python environments, compiled binaries, and some packages may need to be rebuilt. The multi-architecture guidance article covers practical guidance for moving software between x86_64 and ARM64 environments.

Beta hardware

Beta is the newer Blackwell-generation AI environment built around NVL72 B200 resources. It is intended for the largest-scale AI training and high-throughput inference workloads and uses a separate Slurm environment from Alpha and Grace.

Node type	GPU / CPU resource	Scheduler environment	Notes
Beta NVL72 B200 GPU nodes	B200 GPU resources for large-scale AI work	Separate Beta Slurm environment	Best fit for the largest training runs and highest-throughput inference services

Example job submissions

The examples below show how you might target different Alpha and Grace node types from an sbatch script.

Show example job scripts

Example: job on a single Alpha H100 node

#SBATCH --job-name=alpha-h100-test
#SBATCH --partition=<institutional_alpha_partition>
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=04:00:00
#SBATCH --account=<your_account>

module load <modules-you-need>
srun python train.py --model <...>

Example: job on a single Alpha H200 node

#SBATCH --job-name=alpha-h200-large
#SBATCH --partition=<institutional_alpha_partition>
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=08:00:00
#SBATCH --account=<your_account>

module load <modules-you-need>
srun python train.py --model <...> --batch-size <larger_value>

Example: data preprocessing on Grace nodes

#SBATCH --job-name=grace-prep
#SBATCH --partition=grace
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --time=02:00:00
#SBATCH --account=<your_account>

module load <modules-you-need>
srun python preprocess_data.py --input <...> --output <...>

Partition and system mapping

The current scheduler setup is easiest to understand as a system-and-partition map. Alpha H100 and H200 nodes currently sit behind institutional Alpha partitions, Grace uses its own named partition, and Beta is handled through a separate scheduler environment.

System	Node type	GPU / CPU resource	Partition or scheduler mapping
Alpha	Alpha H100 nodes (`alphagpu01-alphagpu18`)	8× H100 80GB GPUs per node	Institutional Alpha partition for now, transitioning to `alpha`
Alpha	Alpha H200 nodes (`alphagpu19-alphagpu24`)	8× H200 GPUs per node, 141GB per GPU	Institutional Alpha partition for now, transitioning to `alpha`
Grace	Grace ARM CPU nodes (`betagg01-betagg60`)	CPU-only ARM64 / aarch64 resources	`grace`
Beta	Beta NVL72 B200 GPU nodes	B200 GPU resources	Separate Beta Slurm environment

How to think about system choice

If your workflow needs...	Use...	Why
Established H100 training or interactive GPU debugging	Alpha H100 nodes	They are the default starting point for experiments and many medium-scale jobs
More GPU memory headroom	Alpha H200 nodes	They provide 141GB per GPU and better support larger models or larger batch sizes
Quick shell access without a GPU	`alphacpu01` in `cpu`	Fast setup and debugging without consuming a GPU allocation
CPU-heavy preprocessing or ARM64 validation	Grace	Grace gives you CPU-only ARM64 resources in the same scheduler ecosystem as Alpha
Newest Blackwell-era AI hardware	Beta	Beta is the separate next-generation platform for the largest and most demanding workloads