Service units and allocations for Alpha+Beta

Created by Robert Harrison, Modified on Fri, 1 Aug at 7:51 AM by Robert Harrison

[The values in this article for service unit multipliers are valid for CY2025, CY2026, and CY2027 until Gamma commences operations in circa June 2027. The values may change after Gamma is in operation.]

All computer resources are assigned a value in Service Units (SUs) that is a familiar concept to anyone that has used any NSF national facility. On Alpha+Beta these values will be as follows

1 Alpha GPU hour = 1 SU
1 Beta GPU hour = 2 SU
1 Grace-Grace node hour = 0.5 SU
1 TB.month = 8.333 SU up to 100 TB; data over 100 TB will be free but allocated and manged separately (see below)
Priority queue access 2x the above
Shared resource access 0.5x the above

These conversion factors are termed multipliers. SUs are a necessary abstraction to unify accounting over disparate resources spanning multiple technology generations, and because there are multiple ways of measuring cost (e.g., cost for charge backs, actual delivery cost, equivalent cost in the cloud, etc.). The cost of a job is computed as

Cost in SUs = duration * amount of resources * multiplier

and the cost in SUs of storage is computed monthly based on average use. Each user will receive 100 GByte of free storage for their home directory.

With the above multipliers, α + β will deliver about 6.3M SUs per year, so each institutional allocation is 0.7M SUs per year. Institutions will give to projects an allocation of SUs that can be expended as needed on any resource, thereby giving maximum flexibility to users and also motivating responsible+efficient system utilization. Empire AI will track and report usage monthly, and usage information will also available in near real time to users and managers via the command line and a dashboard.

In the above "available" SUs to be allocated by institutions, we only included about 10% of the actual available storage, in part because storage use is currently low and we anticipate that most projects will primarily seek to be allocated compute time. As we gain experience with actual usage, this fraction might be adjusted in the future. Projects with large data needs will be managed separately, in part to encourage large projects that presumably have a large data footprint, and also because storage is a finite and fixed resource. Specifically,

data beyond 100 TB will be free, and
due to the finite nature of storage, large storage requests will be accommodated on an as feasible basis with fixed quotas for fixed periods of time to be scheduled with projects in order to accommodate as many large projects as possible.

Projects needing a large allocation of data over an extended period will be managed on a case-by-case basis. If storage becomes in short supply, the above policies may be adjusted as needed. If a project discovers a need for about 100 TB of storage during execution, please submit a ticket to request this access.

Institutions and projects will not be charged for more than their allocation

Projects and institutions that have exceeded allocations will only have access to free, low-priority, preemptible queues (that are otherwise inaccessible) and will be requested to reduce their storage footprint as described in the terms and conditions. This policy provides a smooth end to projects, helps ensure full system utilization, and encourages projects to promptly use allocations.
Users with no allocation will be transitioned off the system after some period of time, again as described in the terms and conditions.

Example “average” use scenarios on α + β

Assuming 35 projects per institutional allocation per year, each project will on average get about 20K SUs each per year.

Here are 4 scenarios of how 20K SUs could be used

Scenario 1: Just computes on H100 GPUs with no additional data:

2.3 GPUs dedicated all year (total AWS estimated cost is about $80K)

Scenario 2: Stores 50 TB for a year and computes on H100 GPUs

50*12 TB-months costs 5K SUs
Leaving 15K SUs equivalent to 1.7 H100 GPUs dedicated all year (total AWS estimated cost is over $100K)

Scenario 3: Stores 100 TB for a year and computes on H100 GPUs

Storage & compute costs equal, providing 1.14 dedicated GPUs (total AWS estimated cost is over $120K)

Large data projects:

1 Pbyte is required for 3 months. A request is made in either the original proposal or via ticket. Only 10K SUs for the first 100TB will be charged, but space above that capacity is scheduled for use for a fixed period of time.
(total AWS estimated cost of 1 PB of fast storage on AWS EBS for 1 year is about $1M)
Such projects will be staged on/off in a timely manner since storage is finite.
Presumably computing costs will be commensurately large.