Cirrus hardware
System overview
Cirrus is a HPE EX4000 supercomputing system which has a total of 256 compute nodes. Each compute node has 288 cores (dual AMD EPYC 9825 144-core 2.2 GHz processors) giving a total of 73,228 cores. Compute nodes are connected together by a HPE Slingshot 11 interconnect.
There are additional User Access Nodes (UAN, also called login nodes), which provide access to the system.
Compute nodes are only accessible via the Slurm job scheduling system.
There are two storage types: home and work. Home is available on login nodes. Work is available on login and compute nodes.
The home file system is provided by Ceph with a capacity of 1 PB.
The work file system consists of an HPE ClusterStor E1000 Lustre storage system with a capacity of 1 PB.
Compute node overview
The compute nodes each have 288 cores. They are dual socket nodes with two 144-core AMD EPYC 9825 processors. There are 192 standard memory nodes and 64 high memory nodes.
Note
Note due to Simultaneous Multi-Threading (SMT) each core has 2 threads, therefore a node has 288 cores / 576 threads. Most users will not want to use SMT.
| Component | Details |
|---|---|
| Processor | 2x AMD Zen5 (Turin) EPYC 9825, 144-core, 2.2 GHz |
| Cores per node | 288 |
| NUMA structure | 8 NUMA regions per node (36 cores per NUMA region) |
| Memory per node | 768 GB (standard), 1,536 GB (high memory) DDR5 |
| Memory per core | 2.66 GB (standard), 5.33 GB (high memory) DDR5 |
| L1 cache | 80 kB/core |
| L2 cache | 1 MB/core |
| L3 cache | 32 MB/CCD |
| Vector support | AVX512 |
| Network connection | 2x 100 Gb/s injection ports per node |
Each socket contains 12 Core Complex Dies (CCDs) and one I/O die (IOD). Each CCD contains 12 cores and 32 MB of L3 cache. Thus, there are 144 cores per socket and 288 cores per node.
More information on the architecture of the AMD EPYC Zen2 processors:
AMD Zen5 microarchitecture
The AMD EPYC 9825 (Turin) processor has a base CPU clock of 2.2 GHz and a maximum boost clock of 3.7 GHz. There are twelve processor dies (CCDs) with a total of 144 cores per socket.
Hybrid multi-die design:
Within each socket, the twelve processor dies are fabricated on a 3 nanometer (nm) process, while the I/O die is fabricated on a 6 nm process. This design decision was made because the processor dies need the leading edge (and more expensive) 3 nm technology in order to reduce the amount of power and space needed to double the number of cores, and to add more cache, compared to previous generation EPYC processors. The I/O die retains the less expensive, older 6 nm technology.
Infinity Fabric technology:
Infinity Fabric technology is used for communication among different components throughout the node: within cores, between cores in a core complex die (CCD), among CCDs in a socket, to the main memory and PCIe, and between the two sockets.
Processor hierarchy
The Zen5 processor hierarchy is as follows:
- Core: A CPU core has private L1I, L1D, and L2 caches, which are shared by two hyperthreads on the core.
- CCD: A core complex die includes 12 cores, 32 MB shared L3 cache and an Infinity Link to the I/O die (IOD). The CCDs connect to memory, I/O, and each other through the IOD.
- Socket: A socket includes twelve CCDs (total of 144 cores), a common centralized I/O die (includes twelve unified memory controllers and eight IO x16 PCIe 5.0 lanes - total of 128 lanes), and, for the first socket, a link to the network interface controller (NIC).
- Node: A node includes two sockets and a network interface controllers (NIC).
CPU core
AMD 9825 is a 64-bit x86 server microprocessor. A partial list of instructions and features supported in Rome includes SSE, SSE2, SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2, AES, FMA, AVX, AVX2 (256 bit), AVX512, Integrated x87 FPU (FPU), Multi-Precision Add-Carry (ADX), 16-bit Floating Point Conversion (F16C), and No-eXecute (NX). For a complete list, run cat /proc/cpuinfo on the Cirrus login nodes.
Each core:
- Can sustain execution of four x86 instructions per cycle, using features such as the micro-op cache, advanced branch prediction, and prefetching. The prefetcher works on streaming data and on variable strides, allowing it to accelerate many different data structures.
- Has two 256-bit Fused Multiply-Add (FMA) units and can deliver up to 16 double-precision floating point operations (flops) per cycle. Thus, the peak double-precision flops per node (at base frequency) is: 128 cores x 2.25 GHz x 16 = 4.6 teraflops.
- Can support Simultaneous Multi-threading (SMT), allowing two threads to execute simultaneously per core. SMT is available on Cirrus compute nodes but example submission scripts all use physical cores only as SMT is not usually beneficial for HPC applications.
NPS setting
The Cirrus compute nodes use NPS4.
Interconnect details
Cirrus has a HPE Slingshot 11 interconnect with 200 Gb/s signalling per node. It uses a dragonfly topology:
-
Nodes are organized into groups.
- 128 Nodes in a group.
- Electrical links between Network Interface Card (NIC) and switch.
- 16 switches per group.
- 2x NIC per node.
- All-to-all connection amongst switches in a group using electrical links.
-
All-to-all connection between groups using optical links.
- 2 groups per Cirrus Cabinet.