Skip to content

Arhitektura HPC Vega

Spodaj najdete tabelo, v kateri sta povzeti vrsta in količina večjih komponent strojne opreme predlagane rešitve za sistem Vega:

Računanje

Particija GPU

Kategorija Komponenta Količina Opis
Infrastruktura Predal 2 XH2000 DLC predal s komponentami: PSU, HYC in stikali IB HDR
Računska Vozlišče GPU 60 4 x Nvidia 100, 2 x AMD Rome 7H12, 512 GB RAM, 2 x HDR dual port mezzanine, 1 x 1.92TB M.2 SSD

Particija CPU

Kategorija Komponenta Količina Opis
Infrastruktura Predal 10 XH2000 DLC predal s komponentami PSUs, HYC in stikali IB HDR
Računska Vozlišče CPU standard 768 201 rezina 3 računskih vozlišč (2 x AMD Rome 7H12 (64c, 2.6GHz, 280W) 256GB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD)
Računska Vozlišče CPU velik pomnilnik 192 64 rezin 3 računskih vozlišč (2 x AMD Rome (64c, 2.6GHz, 280W) 1TB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD)

Pomnilnik

HPST – High-performance storage tier (nivo visokozmogljivega pomnilnika)

Kategorija Komponenta Količina Opis
Pomnilnik Gradnik na podlagi hitrega pomnilnika 10 2U ES400NVX (na napravo: 23 x 6.4 TB NVMe, 8 InfiniBand HDR100, 4 vdelani Lustre VMs, 1 OST in MDT na VM).

LCST – Large Capacity Storage tier (nivo pomnilnika z veliko kapaciteto)

Kategorija Komponenta Količina Opis
Pomnilnik Vozlišče pomnilnika 61 Supermicro SuperStorage 6029P-E1CR24L z 2 x Intel Xeon Silver 421R, 12c, 2.4GHz, 100W, 256GB RAM DDR4 RDIMM 2933MT/s, 1 x 240GB SSD, 2 x 6.4TB NVMe, 24 x 16TB HDD, 2 x 25GbE Mellanox ConnectX-4 DP, 1 x 1GbE IPMI
Notranje omrežje Ceph Stikalo Ethernet 8 Mellanox SN2010. Na stikalo: 18x 25GbE + 4x 100GbE vrata

Prijava in virtualizacija

Kategorija Komponenta Količina Opis
Prijava CPU Vozlišča prijave 4 Atos BullSequana X430-A5 z 2 x AMD EPYC 7H12, 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP
Prijava GPU Vozlišča prijave 4 Atos BullSequana X430-A5 z 1 x NVIDIA Ampere A100 PCIe GPU in 2 x AMD EPYC 7452 (32c, 2.35GHz, 155W), 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP
Servis Virtualizacija/servisna vozlišča 30 Atos BullSequana X430-A5 z 2 x AMD EPYC 7502 (32c, 2.5GHZ, 180W) 512GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP

Infrastruktura omrežja in medsebojne povezave

Kategorija Komponenta Količina Opis
Medsebojno omrežje Stikalo IB 68 40-port Mellanox HDR stikalo, Dragonfly+ topologija
Medsebojne povezave IB HDR100/200 vrata na kartici IB 1230 960 računskih, 60 (x2) GPU, 8 za prijavo, 30 za virtualizacijo, 10 (x8) HCST in 8 (x4) Skyway Gateways z Mellanox ConnectX-6 (enojna ali dvojna vrata)
IPoIB Gateway IB/Ethernet Data Gateway 4 Mellanox Skyway IB do Ethernet Gateway Appliance (na gateway: 8x IB in 8x 100GbE vrat)
Ethernet podatkovno omrežje Stikala Top-Level 2 Cisco Nexus N3K – C3408-S, 192 vrat 100GE aktivirano
Povezljivost WAN IP usmerjevalniki 2 Cisco Nexus N3K – C3636C-R, 5x 100GbE do WAN (na voljo do konca 2021)
Omrežje za glavno upravljanje 10GbE stikalo 2 Mellanox 2410 stikala (na stikalo 48 x 10GbE vrat)
Vhod/izhod omrežja upravljanja pasovne širine 1GbE stikalo 4 Mellanox 4610 stikal (na stikalo 48 x 1GbE + 2 x 10GbE vrata)
Omrežje za upravljanje predalov WELB stikalo 24 Dve integrirani stikali na predal WELB (sWitch Ethernet Leaf Board) s tremi 24-vratnimi Ethernet stikali in enim upravljalnikom za Ethernet (EMC)

Arhitektura GPU

Specifikacije GPU

NVIDIA Datacenter GPU NVIDIA A100
GPU codename GA100
GPU architecture Ampere
Launch date May 2020
GPU process TSMC 7nm
Die size 826mm2
Transitor count 54 bilion
FP64 CUDA cores 3,456
FP32 CUDA cores 6,912
Tensor cores 432
Streaming Multiprocessors 108
Peak FP64 9.7 teraflops
Peak FP64 Tensor Core 19.5 teraflos
Peak FP32 19.5 teraflos
Peak FP32 Tensor Core 156 teraflos/312 teraflops*
Peak BFLOAT16 Tensor Core 312 teraflos/624 teraflops*
Peak FP16 Tensor Core 156 teraflos/624 teraflops*
Peak INT8 Tensor Core 156 teraflos/1,248 teraflops*
Peak INT4 Tensor Core 156 teraflos/2,496 teraflops*
Mixed-precision Tensor Core 156 teraflos/642 teraflops*
Max TDP 400 watts

Vmesnik za upravljanje sistema NVIDIA

Program lahko zaženete z ukazom nvidia-smi, za splošne možnosti dodajte stikalo --help.

Na HPC Vega trenutno funkcionalnost Multi-Instance GPU (MIG) ni omogočena.

[root@gn01 ~]# nvidia-smi 
Wed Jul 12 11:50:30 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:03:00.0 Off |                    0 |
| N/A   50C    P0             140W / 400W |   2584MiB / 40960MiB |     51%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:44:00.0 Off |                    0 |
| N/A   43C    P0              56W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:84:00.0 Off |                    0 |
| N/A   44C    P0              56W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C4:00.0 Off |                    0 |
| N/A   49C    P0              83W / 400W |   2818MiB / 40960MiB |     51%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1503150      C   ...                                        2570MiB |
|    3   N/A  N/A   1510286      C   ...                                        2802MiB |
+---------------------------------------------------------------------------------------+

Topologija vozlišča

redstone

Preverite topologijo vozlišča GPU z ukazom nvidia-smi.

[root@gn01 ~]# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    SYS      X      SYS     SYS     PIX     SYS     16-31,144-159   1               N/A
GPU2    SYS     SYS      X      SYS     SYS     PIX     112-127,240-255 7               N/A
GPU3    SYS     SYS     SYS      X      SYS     SYS     80-95,208-223   5               N/A
NIC0    SYS     PIX     SYS     SYS      X      SYS
NIC1    SYS     SYS     PIX     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

NUMA ID najbližjega CPU je na voljo s stikalom -i, z identifikatorjem GPU[0-3].

[root@gn01 ~]# nvidia-smi topo -C -i 0
NUMA IDs of closest CPU: 3

Prikaži najbolj neposredno pot za izbran par grafičnih kartic.

[root@gn01 ~]# nvidia-smi topo -p -i 0,2
Device 0 is connected to device 2 by way of an SMP interconnect link between NUMA nodes.