Skip to content

Arhitektura HPC Vega

Spodaj najdete tabelo, v kateri sta povzeti vrsta in količina večjih komponent strojne opreme predlagane rešitve za sistem Vega:

Računanje

Particija GPU

Kategorija Komponenta Količina Opis
Infrastruktura Predal 2 XH2000 DLC predal s komponentami: PSU, HYC in stikali IB HDR
Računska Vozlišče GPU 60 4 x Nvidia 100, 2 x AMD Rome 7H12, 512 GB RAM, 2 x HDR dual port mezzanine, 1 x 1.92TB M.2 SSD

Particija CPU

Kategorija Komponenta Količina Opis
Infrastruktura Predal 10 XH2000 DLC predal s komponentami PSUs, HYC in stikali IB HDR
Računska Vozlišče CPU standard 768 201 rezina 3 računskih vozlišč (2 x AMD Rome 7H12 (64c, 2.6GHz, 280W) 256GB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD)
Računska Vozlišče CPU velik pomnilnik 192 64 rezin 3 računskih vozlišč (2 x AMD Rome (64c, 2.6GHz, 280W) 1TB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD)

Pomnilnik

HPST – High-performance storage tier (nivo visokozmogljivega pomnilnika)

Kategorija Komponenta Količina Opis
Pomnilnik Gradnik na podlagi hitrega pomnilnika 10 2U ES400NVX (na napravo: 23 x 6.4 TB NVMe, 8 InfiniBand HDR100, 4 vdelani Lustre VMs, 1 OST in MDT na VM).

LCST – Large Capacity Storage tier (nivo pomnilnika z veliko kapaciteto)

Kategorija Komponenta Količina Opis
Pomnilnik Vozlišče pomnilnika 61 Supermicro SuperStorage 6029P-E1CR24L z 2 x Intel Xeon Silver 421R, 12c, 2.4GHz, 100W, 256GB RAM DDR4 RDIMM 2933MT/s, 1 x 240GB SSD, 2 x 6.4TB NVMe, 24 x 16TB HDD, 2 x 25GbE Mellanox ConnectX-4 DP, 1 x 1GbE IPMI
Notranje omrežje Ceph Stikalo Ethernet 8 Mellanox SN2010. Na stikalo: 18x 25GbE + 4x 100GbE vrata

Prijava in virtualizacija

Kategorija Komponenta Količina Opis
Prijava CPU Vozlišča prijave 4 Atos BullSequana X430-A5 z 2 x AMD EPYC 7H12, 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP
Prijava GPU Vozlišča prijave 4 Atos BullSequana X430-A5 z 1 x NVIDIA Ampere A100 PCIe GPU in 2 x AMD EPYC 7452 (32c, 2.35GHz, 155W), 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP
Servis Virtualizacija/servisna vozlišča 30 Atos BullSequana X430-A5 z 2 x AMD EPYC 7502 (32c, 2.5GHZ, 180W) 512GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP

Infrastruktura omrežja in medsebojne povezave

Kategorija Komponenta Količina Opis
Medsebojno omrežje Stikalo IB 68 40-port Mellanox HDR stikalo, Dragonfly+ topologija
Medsebojne povezave IB HDR100/200 vrata na kartici IB 1230 960 računskih, 60 (x2) GPU, 8 za prijavo, 30 za virtualizacijo, 10 (x8) HCST in 8 (x4) Skyway Gateways z Mellanox ConnectX-6 (enojna ali dvojna vrata)
IPoIB Gateway IB/Ethernet Data Gateway 4 Mellanox Skyway IB do Ethernet Gateway Appliance (na gateway: 8x IB in 8x 100GbE vrat)
Ethernet podatkovno omrežje Stikala Top-Level 2 Cisco Nexus N3K – C3408-S, 192 vrat 100GE aktivirano
Povezljivost WAN IP usmerjevalniki 2 Cisco Nexus N3K – C3636C-R, 5x 100GbE do WAN (na voljo do konca 2021)
Omrežje za glavno upravljanje 10GbE stikalo 2 Mellanox 2410 stikala (na stikalo 48 x 10GbE vrat)
Vhod/izhod omrežja upravljanja pasovne širine 1GbE stikalo 4 Mellanox 4610 stikal (na stikalo 48 x 1GbE + 2 x 10GbE vrata)
Omrežje za upravljanje predalov WELB stikalo 24 Dve integrirani stikali na predal WELB (sWitch Ethernet Leaf Board) s tremi 24-vratnimi Ethernet stikali in enim upravljalnikom za Ethernet (EMC)

Arhitektura GPU

Specifikacije GPU

NVIDIA Datacenter GPU NVIDIA A100
GPU codename GA100
GPU architecture Ampere
Launch date May 2020
GPU process TSMC 7nm
Die size 826mm2
Transitor count 54 bilion
FP64 CUDA cores 3,456
FP32 CUDA cores 6,912
Tensor cores 432
Streaming Multiprocessors 108
Peak FP64 9.7 teraflops
Peak FP64 Tensor Core 19.5 teraflos
Peak FP32 19.5 teraflos
Peak FP32 Tensor Core 156 teraflos/312 teraflops*
Peak BFLOAT16 Tensor Core 312 teraflos/624 teraflops*
Peak FP16 Tensor Core 156 teraflos/624 teraflops*
Peak INT8 Tensor Core 156 teraflos/1,248 teraflops*
Peak INT4 Tensor Core 156 teraflos/2,496 teraflops*
Mixed-precision Tensor Core 156 teraflos/642 teraflops*
Max TDP 400 watts

Vmesnik za upravljanje sistema NVIDIA

Program lahko zaženete z ukazom nvidia-smi, za splošne možnosti dodajte stikalo --help.

Na HPC Vega trenutno funkcionalnost Multi-Instance GPU (MIG) ni omogočena.

[root@gn01 ~]# nvidia-smi 
Wed Jul 12 11:50:30 2023    
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03       Driver Version: 535.54.03  CUDA Version: 12.2   |
|-----------------------------------------+----------------------+----------------------+
| GPU Name         Persistence-M | Bus-Id    Disp.A | Volatile Uncorr. ECC |
| Fan Temp  Perf     Pwr:Usage/Cap |     Memory-Usage | GPU-Util Compute M. |
|                     |           |        MIG M. |
|=========================================+======================+======================|
|  0 NVIDIA A100-SXM4-40GB     On | 00000000:03:00.0 Off |          0 |
| N/A  50C  P0       140W / 400W |  2584MiB / 40960MiB |   51%   Default |
|                     |           |       Disabled |
+-----------------------------------------+----------------------+----------------------+
|  1 NVIDIA A100-SXM4-40GB     On | 00000000:44:00.0 Off |          0 |
| N/A  43C  P0       56W / 400W |   8MiB / 40960MiB |   0%   Default |
|                     |           |       Disabled |
+-----------------------------------------+----------------------+----------------------+
|  2 NVIDIA A100-SXM4-40GB     On | 00000000:84:00.0 Off |          0 |
| N/A  44C  P0       56W / 400W |   8MiB / 40960MiB |   0%   Default |
|                     |           |       Disabled |
+-----------------------------------------+----------------------+----------------------+
|  3 NVIDIA A100-SXM4-40GB     On | 00000000:C4:00.0 Off |          0 |
| N/A  49C  P0       83W / 400W |  2818MiB / 40960MiB |   51%   Default |
|                     |           |       Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                      |
| GPU  GI  CI    PID  Type  Process name              GPU Memory |
|    ID  ID                               Usage   |
|=======================================================================================|
|  0  N/A N/A  1503150   C  ...                    2570MiB |
|  3  N/A N/A  1510286   C  ...                    2802MiB |
+---------------------------------------------------------------------------------------+

Topologija vozlišča

redstone

Preverite topologijo vozlišča GPU z ukazom nvidia-smi.

[root@gn01 ~]# nvidia-smi topo -mp
    GPU0  GPU1  GPU2  GPU3  NIC0  NIC1  CPU Affinity  NUMA Affinity  GPU NUMA ID
GPU0   X   SYS   SYS   SYS   SYS   SYS   48-63,176-191  3        N/A
GPU1  SYS   X   SYS   SYS   PIX   SYS   16-31,144-159  1        N/A
GPU2  SYS   SYS   X   SYS   SYS   PIX   112-127,240-255 7        N/A
GPU3  SYS   SYS   SYS   X   SYS   SYS   80-95,208-223  5        N/A
NIC0  SYS   PIX   SYS   SYS   X   SYS
NIC1  SYS   SYS   PIX   SYS   SYS   X 

Legend:

 X  = Self
 SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
 NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
 PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
 PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
 PIX = Connection traversing at most a single PCIe bridge

NIC Legend:

 NIC0: mlx5_0
 NIC1: mlx5_1

NUMA ID najbližjega CPU je na voljo s stikalom -i, z identifikatorjem GPU[0-3].

[root@gn01 ~]# nvidia-smi topo -C -i 0
NUMA IDs of closest CPU: 3

Prikaži najbolj neposredno pot za izbran par grafičnih kartic.

[root@gn01 ~]# nvidia-smi topo -p -i 0,2
Device 0 is connected to device 2 by way of an SMP interconnect link between NUMA nodes.