The Empire AI Alpha Fairshare and Performance Model

Created by Cesar Arias, Modified on Tue, 28 Apr at 2:41 PM by Cesar Arias

1. Overview

The Empire AI Alpha cluster utilizes a Multi-Factor Priority scheduling system to manage resource allocation across ten participating institutions. This model is designed to ensure equitable access to high performance H100 and H200 resources, prioritizing institutional fairness over a simple first-come, first-served approach.

2. Resource Alignment

The Alpha cluster utilizes resource alignment to match the physical topology of our 8-way HGX nodes. We have implemented a billing weight of 12.0 CPUs per GPU.

Category	Description
Technical Setup	For every 1 GPU requested, Slurm automatically reserves the 12 physical CPU cores and memory bandwidth localized to that GPU’s electrical domain.
Researcher Benefit	This ensures your AI training runs clean and at a steady, predictable speed. On Alpha, your 12 cores belong exclusively to you, meaning your model performance is protected from "noisy neighbor" interference from other users' data processing tasks.
Impact on Fairshare	This ensures that GPU intensive AI jobs and CPU only preprocessing tasks are treated with the same "priority cost" relative to the physical footprint they occupy. It prevents priority gaming, where a user might request a GPU with zero cores to artificially keep their usage low.

3. Institutional Resource Ceilings

To protect the 10 member institutions sharing the pool, a hard ceiling is enforced in every institutional account.

Category	Description
Technical Setup	Every university is capped at a maximum of 96 concurrent GPUs. With 192 production GPUs in the cluster, this limit acts as a capacity guardrail.
Researcher Benefit	You are never at the mercy of another university's project size. Even if a neighboring institution has 50 active researchers, they collectively cannot occupy more than 50% of the cluster. This ensures no single institution can monopolize the resource pool.
Impact on Fairshare	This ceiling prevents any single institution from accumulating so much usage that they permanently bottom out the Fairshare rankings, ensuring the cluster remains a collaborative environment rather than a winner takes all resource.

4. Throughput Policies

Individual researcher throughput is governed by two limits: 30 concurrent jobs (MaxJobsPU) and 200 total jobs (MaxSubmitPU).

Category	Description
Technical Setup	You may have up to 30 jobs running simultaneously and a total of 200 jobs in the system (Running + Pending).
Researcher Benefit	These limits prevent the queue from becoming bloated with thousands of tiny, unmanaged jobs from a single user. A limit of 30 concurrent jobs and 200 total jobs allows a single researcher to run a diverse workload, from many small single-GPU experiments to a few large multi-node training runs, while providing the Backfill Scheduler with the necessary depth to pack small jobs into gaps around larger reservations.
Impact on Fairshare	These limits act as a priority buffer. They prevent a user from running so many jobs at once that their Fairshare score collapses to zero in a single day. This preserves your priority score, ensuring your jobs don't suddenly get stuck for weeks while waiting for usage to decay.

5. Multi-Factor Priority Scoring

Slurm calculates the priority of your pending jobs every hour to determine who starts next. These weights apply globally to every job across all partitions/institutional tiers.

Category	Description
Technical Setup	The system uses a Fairshare weight of 50,000, an Age weight of 25,000, and a Job Size weight of 5,000.
Researcher Benefit	The system moves away from a first come, first served model to an Equity First model. If you haven't run a job recently, your priority score will naturally climb, allowing you to bypass users who have already been running at full capacity.
Impact on Fairshare	This is the heart of the equity engine. It ensures that GPUs are distributed according to the targeted shares assigned to each institution.