Start Here
Empire AI Systems Guide — Choosing the Right System and Node Type
This guide is the starting point for choosing the right Empire AI environment. It gives a hardware-focused view of Alpha, Grace, and Beta so you can understand what is available, what each environment is best for, and how to think about node types, GPU types, CPU architecture, and Slurm targeting.
Related articles
Choosing the right system and node type
Alpha and Beta are designed to complement each other rather than replace one another. Within Alpha, you can choose between H100 and H200 nodes, and for the largest or most demanding workloads you can move to the separate Beta NVL72 B200 environment. Grace nodes fit alongside Alpha for CPU-heavy and ARM64 workflows.
- Alpha H100 nodes (
alphagpu01-alphagpu18) are a solid default for many training runs and experiments that fit within 80 GB of GPU memory per device. They are often a good starting point for model development, tuning, and smaller production workloads. - Alpha H200 nodes (
alphagpu19-alphagpu24) provide 141 GB of GPU memory per device and are better suited to larger models or batch sizes, as well as training runs where you want to reduce activation checkpointing or host-to-device transfers. - Grace ARM nodes (
betagg01-betagg60) are a good fit for pre- and post-processing, data preparation, software validation, and HPC-style CPU workloads that sit next to Alpha GPU workflows without consuming GPU time. - Beta NVL72 B200 nodes are intended for very large models and high-throughput inference at scale. If a workload already saturates Alpha or requires the newest Blackwell-generation AI platform, Beta is likely the better target.
In practice, researchers start on Alpha H100 nodes, move to H200 nodes when they need more memory headroom, place CPU-heavy work on Grace, and then transition the heaviest training and inference workloads to Beta once they are ready to scale out.
Recognizing node types in Slurm
Alpha and Grace share a common Slurm environment, so you select between those resources through partitions and the node ranges documented for each system. The current partition name for Grace is grace, while Alpha currently uses institutional partitions and is expected to transition to an alpha hardware-tier partition.
- Alpha H100 nodes — currently accessed through your institutional Alpha partition, using the Alpha H100 node range
alphagpu01-alphagpu18. - Alpha H200 nodes — currently accessed through your institutional Alpha partition, using the Alpha H200 node range
alphagpu19-alphagpu24. - Grace nodes — accessed through the
gracepartition, using the Grace node rangebetagg01-betagg60for CPU-only ARM64 work.
Beta NVL72 B200 nodes are handled through a separate Slurm environment rather than the shared Alpha and Grace scheduler. To see the current Alpha and Grace configuration, you can run commands such as sinfo to list partitions and node states.
Alpha hardware
Alpha is the general GPU environment for H100/H200-class workflows, interactive GPU debugging, model training, inference, and batch experimentation. The current GPU pool is 192 GPUs across 24 GPU nodes, 8 GPUs per node with 10 ConnectX-7 400Gb/s NICs, 30TB of NVMe caching space, and 2TB of system memory.
| Node type | GPU / CPU resource | Partition | Notes |
|---|---|---|---|
Alpha H100 nodes (alphagpu01-alphagpu18) | 8× H100 80GB GPUs per node | Institutional Alpha partition for now, transitioning to alpha | Default starting point for many experiments and medium-scale training jobs |
Alpha H200 nodes (alphagpu19-alphagpu24) | 8× H200 GPUs per node, 141GB per GPU | Institutional Alpha partition for now, transitioning to alpha | Best when you need more GPU memory headroom or are staging larger jobs before Beta |
alphacpu01 | x86_64 CPU-only resource | cpu | Convenience CPU-only node for setup, lightweight work, and debugging |
Grace hardware
Grace is the CPU-only ARM64 environment that shares the Alpha Slurm scheduler. It is intended for larger CPU-only jobs, preprocessing, data wrangling, memory-heavy workflows, and software validation on ARM64.
| Node type | GPU / CPU resource | Partition | Notes |
|---|---|---|---|
Grace ARM CPU nodes (betagg01-betagg60) | ARM64 / aarch64 CPU-only resources | grace | Good fit for preprocessing, feature extraction, simulation workloads, and CPU-heavy tasks that feed GPU jobs |
Beta hardware
Beta is the newer Blackwell-generation AI environment built around NVL72 B200 resources. It is intended for the largest-scale AI training and high-throughput inference workloads and uses a separate Slurm environment from Alpha and Grace.
| Node type | GPU / CPU resource | Scheduler environment | Notes |
|---|---|---|---|
| Beta NVL72 B200 GPU nodes | B200 GPU resources for large-scale AI work | Separate Beta Slurm environment | Best fit for the largest training runs and highest-throughput inference services |
Example job submissions
The examples below show how you might target different Alpha and Grace node types from an sbatch script.
Show example job scripts
Example: job on a single Alpha H100 node
#SBATCH --job-name=alpha-h100-test
#SBATCH --partition=<institutional_alpha_partition>
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=04:00:00
#SBATCH --account=<your_account>
module load <modules-you-need>
srun python train.py --model <...>Example: job on a single Alpha H200 node
#SBATCH --job-name=alpha-h200-large
#SBATCH --partition=<institutional_alpha_partition>
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=08:00:00
#SBATCH --account=<your_account>
module load <modules-you-need>
srun python train.py --model <...> --batch-size <larger_value>Example: data preprocessing on Grace nodes
#SBATCH --job-name=grace-prep
#SBATCH --partition=grace
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --time=02:00:00
#SBATCH --account=<your_account>
module load <modules-you-need>
srun python preprocess_data.py --input <...> --output <...>Partition and system mapping
The current scheduler setup is easiest to understand as a system-and-partition map. Alpha H100 and H200 nodes currently sit behind institutional Alpha partitions, Grace uses its own named partition, and Beta is handled through a separate scheduler environment.
| System | Node type | GPU / CPU resource | Partition or scheduler mapping |
|---|---|---|---|
| Alpha | Alpha H100 nodes (alphagpu01-alphagpu18) | 8× H100 80GB GPUs per node | Institutional Alpha partition for now, transitioning to alpha |
| Alpha | Alpha H200 nodes (alphagpu19-alphagpu24) | 8× H200 GPUs per node, 141GB per GPU | Institutional Alpha partition for now, transitioning to alpha |
| Grace | Grace ARM CPU nodes (betagg01-betagg60) | CPU-only ARM64 / aarch64 resources | grace |
| Beta | Beta NVL72 B200 GPU nodes | B200 GPU resources | Separate Beta Slurm environment |
How to think about system choice
| If your workflow needs... | Use... | Why |
|---|---|---|
| Established H100 training or interactive GPU debugging | Alpha H100 nodes | They are the default starting point for experiments and many medium-scale jobs |
| More GPU memory headroom | Alpha H200 nodes | They provide 141GB per GPU and better support larger models or larger batch sizes |
| Quick shell access without a GPU | alphacpu01 in cpu | Fast setup and debugging without consuming a GPU allocation |
| CPU-heavy preprocessing or ARM64 validation | Grace | Grace gives you CPU-only ARM64 resources in the same scheduler ecosystem as Alpha |
| Newest Blackwell-era AI hardware | Beta | Beta is the separate next-generation platform for the largest and most demanding workloads |
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article