Arhitektura HPC Vega
Spodaj najdete tabelo, v kateri sta povzeti vrsta in količina večjih komponent strojne opreme predlagane rešitve za sistem Vega:
Računanje
Particija GPU
Kategorija |
Komponenta |
Količina |
Opis |
Infrastruktura |
Predal |
2 |
XH2000 DLC predal s komponentami: PSU, HYC in stikali IB HDR |
Računska |
Vozlišče GPU |
60 |
4 x Nvidia 100, 2 x AMD Rome 7H12, 512 GB RAM, 2 x HDR dual port mezzanine, 1 x 1.92TB M.2 SSD |
Particija CPU
Kategorija |
Komponenta |
Količina |
Opis |
Infrastruktura |
Predal |
10 |
XH2000 DLC predal s komponentami PSUs, HYC in stikali IB HDR |
Računska |
Vozlišče CPU standard |
768 |
201 rezina 3 računskih vozlišč (2 x AMD Rome 7H12 (64c, 2.6GHz, 280W) 256GB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD) |
Računska |
Vozlišče CPU velik pomnilnik |
192 |
64 rezin 3 računskih vozlišč (2 x AMD Rome (64c, 2.6GHz, 280W) 1TB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD) |
Pomnilnik
Kategorija |
Komponenta |
Količina |
Opis |
Pomnilnik |
Gradnik na podlagi hitrega pomnilnika |
10 |
2U ES400NVX (na napravo: 23 x 6.4 TB NVMe, 8 InfiniBand HDR100, 4 vdelani Lustre VMs, 1 OST in MDT na VM). |
LCST – Large Capacity Storage tier (nivo pomnilnika z veliko kapaciteto)
Kategorija |
Komponenta |
Količina |
Opis |
Pomnilnik |
Vozlišče pomnilnika |
61 |
Supermicro SuperStorage 6029P-E1CR24L z 2 x Intel Xeon Silver 421R, 12c, 2.4GHz, 100W, 256GB RAM DDR4 RDIMM 2933MT/s, 1 x 240GB SSD, 2 x 6.4TB NVMe, 24 x 16TB HDD, 2 x 25GbE Mellanox ConnectX-4 DP, 1 x 1GbE IPMI |
Notranje omrežje Ceph |
Stikalo Ethernet |
8 |
Mellanox SN2010. Na stikalo: 18x 25GbE + 4x 100GbE vrata |
Prijava in virtualizacija
Kategorija |
Komponenta |
Količina |
Opis |
Prijava CPU |
Vozlišča prijave |
4 |
Atos BullSequana X430-A5 z 2 x AMD EPYC 7H12, 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
Prijava GPU |
Vozlišča prijave |
4 |
Atos BullSequana X430-A5 z 1 x NVIDIA Ampere A100 PCIe GPU in 2 x AMD EPYC 7452 (32c, 2.35GHz, 155W), 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
Servis |
Virtualizacija/servisna vozlišča |
30 |
Atos BullSequana X430-A5 z 2 x AMD EPYC 7502 (32c, 2.5GHZ, 180W) 512GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
Infrastruktura omrežja in medsebojne povezave
Kategorija |
Komponenta |
Količina |
Opis |
Medsebojno omrežje |
Stikalo IB |
68 |
40-port Mellanox HDR stikalo, Dragonfly+ topologija |
Medsebojne povezave |
IB HDR100/200 vrata na kartici IB |
1230 |
960 računskih, 60 (x2) GPU, 8 za prijavo, 30 za virtualizacijo, 10 (x8) HCST in 8 (x4) Skyway Gateways z Mellanox ConnectX-6 (enojna ali dvojna vrata) |
IPoIB Gateway |
IB/Ethernet Data Gateway |
4 |
Mellanox Skyway IB do Ethernet Gateway Appliance (na gateway: 8x IB in 8x 100GbE vrat) |
Ethernet podatkovno omrežje |
Stikala Top-Level |
2 |
Cisco Nexus N3K – C3408-S, 192 vrat 100GE aktivirano |
Povezljivost WAN |
IP usmerjevalniki |
2 |
Cisco Nexus N3K – C3636C-R, 5x 100GbE do WAN (na voljo do konca 2021) |
Omrežje za glavno upravljanje |
10GbE stikalo |
2 |
Mellanox 2410 stikala (na stikalo 48 x 10GbE vrat) |
Vhod/izhod omrežja upravljanja pasovne širine |
1GbE stikalo |
4 |
Mellanox 4610 stikal (na stikalo 48 x 1GbE + 2 x 10GbE vrata) |
Omrežje za upravljanje predalov |
WELB stikalo |
24 |
Dve integrirani stikali na predal WELB (sWitch Ethernet Leaf Board) s tremi 24-vratnimi Ethernet stikali in enim upravljalnikom za Ethernet (EMC) |
Arhitektura GPU
Specifikacije GPU
NVIDIA Datacenter GPU |
NVIDIA A100 |
GPU codename |
GA100 |
GPU architecture |
Ampere |
Launch date |
May 2020 |
GPU process |
TSMC 7nm |
Die size |
826mm2 |
Transitor count |
54 bilion |
FP64 CUDA cores |
3,456 |
FP32 CUDA cores |
6,912 |
Tensor cores |
432 |
Streaming Multiprocessors |
108 |
Peak FP64 |
9.7 teraflops |
Peak FP64 Tensor Core |
19.5 teraflos |
Peak FP32 |
19.5 teraflos |
Peak FP32 Tensor Core |
156 teraflos/312 teraflops* |
Peak BFLOAT16 Tensor Core |
312 teraflos/624 teraflops* |
Peak FP16 Tensor Core |
156 teraflos/624 teraflops* |
Peak INT8 Tensor Core |
156 teraflos/1,248 teraflops* |
Peak INT4 Tensor Core |
156 teraflos/2,496 teraflops* |
Mixed-precision Tensor Core |
156 teraflos/642 teraflops* |
Max TDP |
400 watts |
Vmesnik za upravljanje sistema NVIDIA
Program lahko zaženete z ukazom nvidia-smi
, za splošne možnosti dodajte stikalo --help
.
Na HPC Vega trenutno funkcionalnost Multi-Instance GPU (MIG) ni omogočena.
[root@gn01 ~]# nvidia-smi
Wed Jul 12 11:50:30 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:03:00.0 Off | 0 |
| N/A 50C P0 140W / 400W | 2584MiB / 40960MiB | 51% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:44:00.0 Off | 0 |
| N/A 43C P0 56W / 400W | 8MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:84:00.0 Off | 0 |
| N/A 44C P0 56W / 400W | 8MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:C4:00.0 Off | 0 |
| N/A 49C P0 83W / 400W | 2818MiB / 40960MiB | 51% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1503150 C ... 2570MiB |
| 3 N/A N/A 1510286 C ... 2802MiB |
+---------------------------------------------------------------------------------------+
Topologija vozlišča
Preverite topologijo vozlišča GPU z ukazom nvidia-smi
.
[root@gn01 ~]# nvidia-smi topo -mp
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 SYS X SYS SYS PIX SYS 16-31,144-159 1 N/A
GPU2 SYS SYS X SYS SYS PIX 112-127,240-255 7 N/A
GPU3 SYS SYS SYS X SYS SYS 80-95,208-223 5 N/A
NIC0 SYS PIX SYS SYS X SYS
NIC1 SYS SYS PIX SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NUMA ID najbližjega CPU je na voljo s stikalom -i
, z identifikatorjem GPU[0-3]
.
[root@gn01 ~]# nvidia-smi topo -C -i 0
NUMA IDs of closest CPU: 3
Prikaži najbolj neposredno pot za izbran par grafičnih kartic.
[root@gn01 ~]# nvidia-smi topo -p -i 0,2
Device 0 is connected to device 2 by way of an SMP interconnect link between NUMA nodes.