Alpha Cluster Update: Power Maintenance (3/10 – 3/11). **COMPLETED**

Posted about 1 month ago by Cesar Arias

C
Cesar Arias Admin

:rotating_light:Update:  Unexpected delays in the power work have extended the time required to bring the Alpha cluster back online.
The team is actively working on system restoration and validation. We will share another update once the cluster is fully available.


Scheduled Maintenance: Alpha Storage Subsystem
Nature of Work: Power event affecting Alpha storage infrastructure.
Impact Level: Storage Offline; Compute nodes remain in standby.

Maintenance Schedule

  • Starts: Tuesday, March 10 @ 9:00 AM
  • Ends: Wednesday, March 11 @ 9:00 AM

Job Management
To ensure a smooth transition and protect your active research, we are utilizing Slurm’s automated scheduling tools:

  • Reservations: A system-wide reservation will be in place for the duration of the event.
  • Job Draining: We will set nodes to "draining" for long-running tasks. If you submit a job that requires more time than is available before the morning of 3/10, the scheduler will hold it in a pending state.
  • Resume Policy: Once the storage subsystems are verified, all pending and held jobs will automatically resume their position in the queue.

User Impact
While the compute nodes will remain powered on, the storage backend will be inaccessible. Please ensure all interactive work is saved and closed before 9:00 AM on Tuesday 3/10.

0 Votes


0 Comments

Login to post a comment