Production Release of the Afton HPC System: July 2, 2024

Our new supercomputer, “Afton,” is now available for general use. This represents the first major expansion of RC’s computing resources since Rivanna's last hardware refresh in 2019. Afton represents a substantial increase in the High-Performance Computing (HPC) capabilities available at UVA, more than doubling the available compute capacity. Each of the 300 compute nodes in the new system has 96 compute cores, an increase from a maximum of 48 cores per node in Rivanna. The increase in core count is augmented by a significant increase in memory per node. Each Afton node boasts a minimum of 750GB of memory, with some supporting up to 1.5TB. The large amount of memory per node allows researchers to efficiently work with the ever-expanding datasets we are seeing across diverse research disciplines.

Maintenance: July 2, 2024

You may continue to submit jobs until the maintenance period begins, but if the system determines your job will not have time to finish, it will not start until the HPC systems are returned to service.

The Rivanna and Afton production systems are expected to return to service by Wednesday, July 3 at 6 a.m.

Questions: Please contact our user services team, or join us for our virtual office hours every Tuesday, 3-5 p.m. and Thursday, 10-12 p.m..

What to expect after the maintenance?

  • New hardware: On May 28, a total of 300 compute nodes, 96 cores each, based on the AMD EPYC 9454 architecture have been added to UVA’s HPC environment as the new Afton system. The new Afton hardware provides additional capacity for serial, parallel and GPU computing side-by-side with the existing Rivanna system.

  • Configuration: The hardware partition definitions will be reconfigured to optimize effective use of the new Afton and existing Rivanna systems. (The Weka scratch filesystem will be mounted in non-dedicated mode, which means all cores will be available. Previously 3 cores per node were dedicated to Weka.)

  • Access: The Rivanna and Afton systems are accessible via the existing and shared Open OnDemand, FastX and ssh access points.

  • Software, Code, and Job Submissions: The shared software stack and modules have been tested during the pre-release phase. In most cases users can utilize the system without any changes to their job submission scripts. In some instances users may need to update their Slurm job scripts or recompile their own code. The RC team is available to help with the transition.

  • Policy: A new charge rate policy will be implemented during Fall 2024 (tentative) to reflect more closely the actual hardware cost.

FAQ

No, the new Afton system exists side-by-side with the existing Rivanna system. Both systems are accessible through shared login nodes, see “How do I log in to the Afton system?".

On May 28, a total of 300 compute nodes with 96 cores each, based on the AMD EPYC 9454 architecture, have been added to UVA’s HPC environment. The added nodes expand UVA’s HPC capabilities in the following areas:

  • A complete hardware refresh of the parallel partition with 96-core nodes that roughly doubles its capacity (based on aggregated cpu core count).

  • Expanded capacity of the standard partition for single node jobs and high-throughput computing with up to 96 cores and 1.5TB of memory per node.

  • Addition of new nodes with NVIDIA A40 general purpose graphics processing units (GPUs) to accommodate more ML/DL computing in the gpu partition.

The login access points are shared for the Afton and Rivanna systems.

See here for details. You must be a member of an active HPC allocation before you can log in.

Yes, login access points are shared for the Rivanna and Afton systems. We added new hardware feature tags that allow you to specifically request Rivanna resources for your compute jobs once logged in.

See “What are the changes to the hardware partitions?" and “What are hardware features? What are the hardware feature defaults for each partition?".

The following partition changes are taking place on July 2:

  • The pre-release afton partition will be removed. The nodes will be placed in other partitions.
  • The nodes making up the parallel partition will be completely replaced with 200 Afton nodes. The original nodes will be placed into standard.
  • The largemem partition will be removed. All 750GB nodes will be placed in the standard partition.
  • All RTX3090 nodes from the gpu partition will be placed in the interactive partition.

New partition configuration:

Partition Rivanna Nodes Afton Nodes Use Cases
standard yes yes For jobs on a single compute node, including those with large memory requirements.
parallel no yes For large parallel multi-node jobs.
gpu yes yes For jobs using general purpose graphical processing units, e.g. for machine learning/deep learning.
interactive yes yes For quick interactive sessions, code development, and instructional use. It includes a small number of lower-end GPU nodes.

Nodes of the largemem partition have been moved to the standard partition. See “What are the changes to the hardware partitions?"

The dev and instructional partitions were merged and replaced with a single interactive partition during the Afton pre-release on May 30.

Features constraints and generic resources (GRES) allow you to request specific hardware within a given partition. Through feature constraints you can specify if a job should be scheduled on the new Afton hardware or the older Rivanna system.

Feature constraints are optional; you may submit jobs without feature constraints. If no feature constraint is specified, the Slurm scheduler will place your job on available partition hardware following a default priority order.

Note: Not all features are available in every partition. This table lists the available features for each partition, including the default if no feature is specified.

Partition Available Features Constraints GRES Default Priority Order Notes
standard afton, rivanna None rivanna > afton If not specified, the scheduler will attempt to place the job on Rivanna hardware first or Afton hardware as second alternative.
parallel None None n/a The entire partition is configured with new Afton nodes. No feature constraint is required.
gpu None v100, a40, a6000, a100_40gb, a100_80gb v100 > a6000 > a40 > a100_40gb> a100_80gb If no GRES request is specified, the scheduler will attempt to place the job on a V100 node first and A100 80GB nodes (i.e. the BasePOD) hardware as last alternative. The A40 nodes were purchased along with the new Afton hardware.
interactive afton, rivanna rtx2080, rtx3090 rivanna > afton If not specified, the scheduler will attempt to place the job on Rivanna hardware first or Afton hardware as second alternative.

See “How do I use Afton for my Slurm job? Do I need to update my job scripts?" and “How can I use the new Afton hardware in Open OnDemand?" for instructions on using these feature constraints in your job submission scripts or Open OnDemand.

Most users should be able to submit jobs without changing their Slurm job scripts, unless:

  • invalid request due to partition changes (see “What are the changes to the hardware partitions?")

    • Example: A job submitted to largemem will become invalid since the partition has been removed. One should submit to standard with --mem=... (up to 1462G) to specify the memory.
  • cost considerations (see How is compute time charged on the Rivanna and Afton systems?)

    • Example: Instead of running a light GPU job on an A100 in gpu, request an RTX2080 or RTX3090 in interactive via --gres=gpu.
  • a need for specific Rivanna vs Afton hardware for performance/reproducibility/benchmarking reasons (only relevant for standard and interactive)

    • Example: To restrict a standard job to run on the new Afton hardware, provide a constraint (--constraint= or -C):
#SBATCH -p standard
#SBATCH --constraint=afton

and likewise for Rivanna hardware:

#SBATCH -p standard
#SBATCH --constraint=rivanna

When setting up an Interactive App session in Open Ondemand you may enter the --constraint=afton or --constraint=rivanna feature constraint in Optional: Slurm Option ( Reservation, Constraint ) field.

Available feature constraints are listed here: “What are hardware features? What are the default hardware features for each partition?".

If you have already done this for the Afton pre-release testing then no. Otherwise please use the following flowchart.

  • Which compiler did you use to build your code?

    • Not Intel (e.g. GCC gcc, NVIDIA nvhpc) → no
    • Intel → continue
  • Do you intend to run your code on Afton hardware? (Please note the parallel partition will be completely replaced by Afton hardware.)

    • No → no
    • Yes → continue
  • Did you use the -x flag (e.g. -xavx)?

    • No → no
    • Yes → yes, rebuild with -march=skylake-avx512 instead of -x...

Starting Fall 2024 (tentative), a new service unit (SU) charge rate policy will be implemented to reflect more closely the actual hardware cost. For all non-GPU jobs, the SU charge rate will be based on the amount and type of CPU cores (Intel on Rivanna, AMD on Afton) plus memory allocated. For GPU jobs (in gpu and interactive), the SU charge rate will be based on the number and type of GPU devices allocated.

Partition Hardware Charge per core Charge per GB memory Charge per GPU device
standard Rivanna -
standard Afton -
parallel Afton -
interactive (non-GPU) Rivanna -
interactive (non-GPU) Afton -
interactive (GPU) RTX2080 - -
interactive (GPU) RTX3090 - -
gpu V100 - -
gpu A6000 - -
gpu A40 - -
gpu A100 (40G) - -
gpu A100 (80G) - -

Starting Fall 2024 (tentative), a new charge rate policy will be implemented to reflect more closely the actual hardware cost. For all non-GPU jobs, the charge rate will be based on the amount of CPU cores and memory allocated. For GPU jobs (in gpu and interactive), the charge rate will be based on the amount of GPU devices allocated.

Use of Afton hardware may allow jobs to complete faster but may consume more SUs overall due to a higher burn rate.

To ensure fair access to the HPC environment for all research groups, we utilize Slurm’s job accounting and fairshare system. This system influences job placement priority, with a higher fairshare value typically resulting in a higher queue priority. However, the fairshare value decreases as more service units are consumed.

Crucially, fairshare values are linked to the Principal Investigator (PI) of the allocation being utilized. This connection prevents any single group from dominating the resources and maintains fairness across PI groups, especially those who have not utilized their fairshare allocation for an extended period.

Paid service units place fairshare values and job priority above those of users utilizing instructional or standard allocations.

The high performance new Afton hardware as well as the higher-end GPU hardware incur higher service unit (SU) burn rates. For example, allocation of 40 cores and 256GB of memory on a new Afton node consumes more service units per hour than the same cpu core and memory allocation on an older Rivanna node. Similarly, use of an NVIDIA A100 80GB GPU device incurs a higher SU charge per hour compared to a lower-end NVIDIA A6000 GPU device.

The more SUs have been consumed, the lower the fairshare value drops. This will impact the user’s priority when submitting new jobs with the same allocation.

Use of Afton hardware may allow jobs to complete faster but may consume more SUs overall due to a higher burn rate. See “How is compute time charged on the Rivanna and Afton systems?".

Please contact our user services team, or join us for our virtual office hours every Tuesday, 3-5 p.m. and Thursday, 10-12 p.m..

Afton Release Announcements

Effective May 30, the new Afton HPC hardware is now available in a pre-release configuration as part of the HPC environment. During this pre-release phase the number of available Afton nodes may fluctuate as the RC team completes final configurations. The full production release of the Afton cluster with stable service of all 300 nodes is planned for Tuesday, July 2. Learn more.

Dear Rivanna user:

A friendly reminder that Rivanna and Research Project storage will be down for maintenance from Tuesday, May 28 at 6 a.m. through Thursday, May 30 6 a.m..

How to prepare and what to expect during the maintenance?
You may continue to submit jobs to Rivanna until the maintenance period begins, but if the system determines your job will not have time to finish, it will not start until Rivanna is returned to service. All Rivanna compute nodes and login nodes will be offline, including the Open OnDemand and FastX portals. Research Project storage will be unavailable. The UVA Standard Security Storage data transfer node (DTN) and Research Standard storage remain online throughout the maintenance period.

Pre-release of the new Afton cluster after the maintenance
All systems are expected to return to service by 6 a.m. on Thursday, May 30. The new Afton HPC hardware will become available in a pre-release configuration at that time, with the addition of 300 new compute nodes, 96 cores each, based on the AMD EPYC 9454 architecture. The new Afton hardware will provide additional capacity for serial, parallel and GPU computing side-by-side with the existing Rivanna system. During this pre-release phase the number of available Afton nodes may fluctuate as the RC team completes final configurations. The full production release of the Afton cluster with stable service of all 300 nodes is planned for Tuesday, July 2.

A detailed description of the maintenance plan and instructions for using the new Afton resources is available on the RC website.

If you have any questions about the Rivanna maintenance or Afton pre-release, you may contact our user services team.

At your service, RC staff

Research Computing

E hpc-support@virginia.edu
P 434.243.1107

University of Virginia
P.O. Box 400231
Charlottesville 22902

Dear Rivanna user:

A reminder that Rivanna and Research Project storage will be down for maintenance from Tuesday, May 28 at 6 a.m. through Thursday, May 30 6 a.m..

How to prepare and what to expect during the maintenance?
You may continue to submit jobs to Rivanna until the maintenance period begins, but if the system determines your job will not have time to finish, it will not start until Rivanna is returned to service. All Rivanna compute nodes and login nodes will be offline, including the Open OnDemand and FastX portals. Research Project storage will be unavailable. The UVA Standard Security Storage data transfer node (DTN) and Research Standard storage remain online throughout the maintenance period.

Pre-release of the new Afton cluster after the maintenance
All systems are expected to return to service by 6 a.m. on Thursday, May 30. The new Afton HPC hardware will become available in a pre-release configuration at that time, with the addition of 300 new compute nodes, 96 cores each, based on the AMD EPYC 9454 architecture. The new Afton hardware will provide additional capacity for serial, parallel and GPU computing side-by-side with the existing Rivanna system. During this pre-release phase the number of available Afton nodes may fluctuate as the RC team completes final configurations. The full production release of the Afton cluster with stable service of all 300 nodes is planned for Tuesday, July 2.

A detailed description of the maintenance plan and instructions for using the new Afton resources is available on the RC website.

If you have any questions about the Rivanna maintenance or Afton pre-release, you may contact our user services team.

At your service, RC staff

Research Computing

E hpc-support@virginia.edu
P 434.243.1107

University of Virginia
P.O. Box 400231
Charlottesville 22902