2026-02-23
Discoveries
- Forge Awesome List Aggregates
- https://www.awesomelists.io/
- https://awesome.ecosyste.ms/
- Note: When searching for "awesome" on Github, you get 1000 results that all have >2K stars.
- hotgithub.com - Curated github projects of interest.
AI Hardware Research
NVIDIA
NVIDIA Details
cuda
DGX Platform - Unified AI development solution for enterprise.
- DGX SuperPOD - All NVIDIA things in one package.
- DGX Vera Rubin NVL72 Systems - Liquid-cooled racked systems.
- DGX Rubin NVL8 - Turnkey platform.
- GPUs: 8x Rubin GPUs
- Memory / Bandwidth: 2.3TB / 160 TB/s
- Perf: NVFP4 Inf: 400PF, NVFP4 Trn: 280PF, FP8/FP6 Trn: 140 PF
- CPU: 2x Intel Xeon 6776P
- NVLink Switch / Bandwidth: 4x / 28.8 TB/s
- Power: ~24kW
- Networking:
- 8x OSFP (NVIDIA ConnectX-9 VPI) ports
- Max 800 GB/s InfiniBand and Ethernet
- 8x OSFP (NVIDIA ConnectX-9 VPI) ports
- SW: NVIDIA DGX OS, Ubuntu, Red Hat Enterprise Linux, Rocky
- DGX GB300 Systems - Grace Blackwell Ultra Superchips.
- GPUs: 72x Blackwell Ultra GPUs, 36x Grace CPUs
- GPU Mem / Bandwidth: 20 TB / 576 TB/s
- CPU Cores: 2592x Arm Neoverse V2 Cores
- Fast Mem: 37 TB
- Perf: FP4: 1440 PF/1080 PF*, FP8/FP6: 720 PF
- *?
- Net: 72x (ConnectX-8 VPI) OSFP ports (800 GB/s Infiniband)
- NVLink: 9x L1 NVIDIA Switches
- Mgmt: BMC with RJ45
- SW: Mission Control, AI Enterprise, DGX OS / Ubuntu
- DGX B300 Systems - Blackwell Ultra
- GPU: 8x Blackwell Ultra SXM
- GPU Mem: 2.1 TB
- CPU: Intel Xeon 6776P Processors
- Perf: FP4 - 144 PF | 108 PF, FP8: 72 PF
- NVLink: 2x
- 14.4 TB/s
- Net: 8x (ConnectX-8 VPI) OSFP ports (800 GB/s Infiniband/Ethernet), 2x QSFP112 BlueField-3 DPU
- Mgmt: 1GbE NIC with RJ45 and BMC
- Storage: OS - 1.9TB NVMe M.2, Internal - 8x 3.84 TB NVMe E1.S
- Power: ~14kW
- Software:
- AI Enterprise
- Mission Control
- Run:ai technology
- DGX OS, RHEL, Rocky, Ubuntu
- Rack Units: 10U
- GPU: 8x Blackwell Ultra SXM
- DGX GB200 Systems - Large scale Blackwell Ultra
- GPU: 72x Blackwell GPUs, 36x Grace CPUs
- 13.4 TB HBM3e | 576 TB/s
- CPU Cores: 2592 Arm Neoverse V2 Cores
- Fast Mem: 30.2 TB
- Perf: FP4: 1440 PF | 720 PF, FP8/FP6: 720 PF | 360 PF
- Interconnect: 72x (ConnectX-7 VPI) OSFP ports (400 GB/s Infiniband), 36x dual-port BlueField-3 VPI with 200GB/s Infiniband/Ethernet
- NVLink Switch: 9x L1 LVLink Switches
- Mgmt: BMC with RJ45
- SW: Mission Control, AI Enterprise, DGX OS / Ubuntu
- GPU: 72x Blackwell GPUs, 36x Grace CPUs
- DGX B200 Systems - Blackwell Ultra
- GPU: 8x Blackwell GPUs
- Mem: 1440 GB, 64 TB/s HBM3e bandwidth
- Perf: FP4: 144PF | 72 PF, FP8: 72 PF
- NVSwitch: 2x
- NVLink Bandwidth: 14.4 TB/s
- Power: 14.3 kW
- CPU: 2 Intel Xeon 8570, 112 Cores (2.1GHz/4GHz)
- SysMem: 2-4TB
- Net: 4x (ConnectX-7 VPI) OSFP ports (400 GB/s), 2x dual port QSFP112 BlueField-3 DPU (400 GB/s)
- Mgmt: 10 GB/s, 100 GB/s dual port, BMC with RJ45
- Storage: 2x 1.9 TB NVMe M.2, 8x 3.84 TB NVMe U.2
- Software:
- AI Enterprise - Optimized AI software
- Mission Control - Ops and Orch with Run:ai tech
- DGX OS / Ubuntu - OS
- Rack Units: 10U
- GPU: 8x Blackwell GPUs
- DGX H200 Systems -
- GPU: 8x H200 (Tensor Core) GPUs
- Mem: 1128 GB
- Perf: FP8 - 32 PF
- NVSwitch: 4x
- Power: 10.2 kW
- CPU: 2x Intel Xeon Platium 8480C, 112 cores, 2.0GHz/3.8GHz
- SysMem: 2TB
- Net: 4x (ConnectX-7 VPI) OSFP (400 GB/s Infiniband/Eth)
- Mgmt: BMC via RJ45
- Storage: 2x 1.9 TB NVMe M.2, Int: 8x 3.84 TB NVMe U.2
- Software:
- AI Enterprise
- Base Command
- DGX OS / Ubuntu / RHEL / Rocky
- GPU: 8x H200 (Tensor Core) GPUs
NVIDIA Vera CPU
- 88 Olympus Cores
- ARMv9.2
- FP8 precision
- Spatial Multithreading - enables 176 total threads through resource partitioning (in contrast to time slicing)
- 1.2 TB/s memory bandwidth
- Up to 1.5 TB of LPDDR5X memory
- NVIDIA SCF (across 88 cores on single die)
- NVIDIA NVLink-C2C - 1.8 TB/s - unified memory of CPU and GPU
Vera Rubin NVL72
- 72 Rubin GPUs
- 36 Vera CPUs
- ConnectX-9 SuperNICs
- BlueField-4 DPUs
- NVLink 6 Switch
- NVIDIA Quantum-X800 InfiniBand
- Spectrum-X Ethernet
Rubin GPU
- compression to boost NVFP4
- 50 petaFLOPS of NVFP4
- compatible with blackwell
- Security ("Confidential Computing")
- 3rd gen "trusted execution environment" across Vera, Rubin, and NVLink.
- Attestation services with cryptographic proof of compliance
- NVLink 6 Switch
- 3.6 TB/s bandwidth per GPU
- 260 TB/s of connectivity
- NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) - reduces network congestion by 50%.
- 2nd Gen Reliability, Availability, and Serviceability (RAS) Engine
- ... health checks and monitoring
- SOCAMM LPDDR5X
Platforms:
-
DGX - Turnkey AI supercomputing
-
HGX - High-density GPU building blocks
-
EGX - Edge/IoT AI
-
OVX - Omniverse Enterprise Workloads and Sims (3D focus)
-
AGX - Autonomous machines and robotics
-
IGX - Industrial and ruggedized AI deployments
-
NGC - NVIDIA GPU Cloud
-
JNX / Jetson - Edge AI modules and dev kits
-
Drive / DDX / ADX - Automotive, self-driving
-
RTX / Quadro / T - GPUs for workstations, 3D graphics, rendering
-
Maxine - Cloud AI SDKs for audio/video (virtual conferencing)
-
Mellanox - Infiniband, NVSwitch
-
CUDA - GPU Programming platform
-
TensorRT - Optimized inference runtime
-
Isaac - robotics
-
Omniverse - simulation
-
Clara - healthcare
-
Metropolis - smart cities
-
Note: XGX → “X” tells you the domain (D=Data center, H=Hyperscale, E=Edge, A=Autonomous, I=Industrial, O=Omniverse).
-
Note: Other names (Jetson, Drive, Clara) → platforms or SDKs for domain-specific hardware or software.
Components:
- Grace - ARM-based server CPU (Neoverse V2 Cores)
- Grace CPU Superchip - Two Grace CPUs linked with NVLink-C2C
- Hopper - GPU Arch (pre data center) (eg. GH200)
- Blackwell - GPU Arch (for data centers) - basis of B200
- Blackwell Ultra - GPU Arch - basis for B300
- Rubin - GPU Arch - successor to Blackwell (e.g. VR200).
- Rubin Ultra - GPU Arch - Multi-die Rubin GPU
- Vera - Arm-based CPU for integration with Rubin
- Olympus - CPU core architecture - Cores in Vera CPUs
- NVLink - Interconnect between CPU and GPU
- ConnectX-9 / BlueField - Network / DPU / NIC
- SpectrumX - Switch/Switch ASIC
NVIDIA Timeline:
- 1997-2000 - NV3, NV4, NV10 (GPU Arch) - Initial Fixed-Function Graphics
- 1999 - Celsius (GPU Arch) - Direct3D 7 support.
- 2001 - Kelvin aka Geforce3/NV20 (GPU Arch)
- 2002 - Rankine aka Geforce4/NV25/NV28 (GPU Arch)
- 2004 - Curie aka Geforce6&7/NV40 (GPU Arch)
- 2006 - Tesla Geforce8 (GPU Arch) - First unified shader arch.
- 2008 - Tesla Geforce9 (GPU Arch)
- 2008 - Tegra (Denver) (CPU Core)
- 2009 - Tesla Geforce 100 / 200 / 300 (GPU Arch)
- 2010 - Fermi Geforce 400 / 500 (GPU Arch) - HPC-ready
- 2012 - Kepler Geforce 600 / 700 (GPU Arch) - Compute focused
- 2014 - Maxwell Geforce 800 / 900 (GPU Arch) - Energy improvements
- 2016 - Pascal "Geforce 10" (GPU Arch) - AI inf foundations
- 2017 - Volta (GPU Arch)
- 2018 - Xavier (CPU Core) - Robotics/AV for Jetson AGX Xavier
- 2018 - Turing "Geforce 20" (GPU Arch) - Tensor cores + ray tracing
- 2020 - Ampere "Geforce 30" (GPU Arch)
- 2021 - Grace (CPU Arch)
- 2022 - Orin (CPU Core) - Automotive/Edge w/ Jetson/DRIVE
- 2022 - Ada Lovelace "Geforce 40" (GPU Arch) - DLSS3
- 2022 - Grace Superchip (CPU Arch)
- 2022 - Hopper (GPU Arch)
- 2024 - Blackwell (GPU Arch)
- 2025 - Blackwell Ultra (GPU Arch)
- 2026 - Blackwell "Geforce 50" (GPU Arch)
- 2026 - Rubin (GPU Arch)
- 2026 - Vera (CPU Arch)
- 2026? - N1 / N1X (CPU Arch)
- 2027 - Rubin Ultra (GPU Arch)
- 2028 Feynman (GPU Arch)
Geforce 50 Blackwell
- Mem: GDDR6X / GDDR7
- Bus: 384-512 bit bus
- ECC: None
- Power: ~400W
- Limited NVLink
Data Center Blackwell
- Mem: HBM3 / HBM3e
- Bus: 4096 bit bus
- ECC: Full ECC
- Power: ~700W per GPU
- NVLink/NVSwitch (Interconnect), MIG (Virtualization)
Alibaba
Alibaba Details
-
2018- T-Head Semiconductor Founded
-
2019 - Hanguang 800
-
Xuantie CPU family - RISC-V architecture
- 2019 - Xuantie 910 - 64bit - general purpose CPU
- 2022 - Xuantie 920 - 64bit - server acceleration
- 2023 - Xuantie 930 - 64bit - very low power
- 2020 - Xuantie E&C - 32bit - Xuantie E902 for IoT
-
PPU (Processing Personal Unit) - on par with Nvidia H20 and Huawei Ascend 910B
- 96GB HBM2e memory
- Interconnect (700GB/s)
- PCIe support
- 2026 - Zhenwu 810E - AI Accelerator (between Nvidia A800 and H20)
ARM
ARM Details
Immortalis GPU Family
- G715 - ray tracing, shading
- G720 - DVS, better mem and pwr
- G925 - more shaders, better AI/ML support
Timeline
- Midgard
- 2008 - Mali-200 - basic GPU, OGL ES 2.0
- 2010 - Mali-400 - multi-core, OGL ES 2.0
- 2011 - Mali-T604/T624/T628 - unified shaders, OGL ES 3.0
- 2013 - Mali-T760/T820/T880 - OGL ES 3.1
- Valhall
- 2017 - Mali-G71/G72/G76/G77 - ai support
- 2022 - Immortalis‑G715 - ray tracing, var shading, AI/ML
- 2023 - Immortalis‑G720 - Deferred Vertex Shading, imp mem and pwr 2025 - Immortalis‑G925 - 24 shader cores
Huawei
Huawei Details
Hardware
-
"Standard GPU ARM licensed cores"
- 2014 - Kirin 910/910T - Quad-core Mali-450 MP4
- 2014 - Kirin 920 - Octa-core Mali-T628 MP4
- 2015 - Kirin 930/935 - Octa-core - Mali-T628 MP4
- 2015 - Kirin 950 - Quad Cortex-A72+A53 - Mali-T880 MP4
- 2016 - Kirin 955 / 960 - Quad Cortex-A73+A53 Mali-G71 MP8
- 2017 - Kirin 970 - Octa-core Cortex-A73+A53 Mali-G72 MP12 + NPU
- 2018 - Kirin 980 - Octa-core Cortex-A76+A55 Mali-G76 MP10 + NPU
- 2018 - Kirin 710/710A Octa-core Cortex-A73+A53 Mali-G51 MP4
- 2019 - Kirin 810 - Octa-core Cortex-A76+A55 Mali-G52 MP6 + Mini NPU 2019 - Kirin 990 4G - Octa-core Cortex-A76+A55 Mali-G76 MP16 + Dual NPU
- 2019 - Kirin 990 5G - Octa-core Cortex-A76+A55, Mali-G76 MP16 + Dual NPU
- 2020 - Kirin 985 - Octa-core Cortex-A76+A55 Mali-G77 MP10 + NPU
- 2020 - Kirin 820/820E - Octa-core Cortex-A77+A55 Mali-G57 MP6 + Mini NPU
- 2020 - Kirin 9000/9000E - Octa-core-A77+A55 Mali-G78 MP24/MP22 + Dual NPU
- 2021 - Kirin 9000s - Octa-core Mali-G78 MP24 + Dual NPU
-
AI Accelerator Arch:
- Davinci (GPU Arch) - Used in Ascend AI/NPU Chips
- Ascend Configs
- 2018 - Ascend 310 - Ultra-low power
- 2019 - Ascend 910 - Large scale.
- 2022 - Ascend 910B - (Domestic?) Variant of 910
- 2025 - Ascend 910C - Multi-910 chips
- 2026? - Ascend 910D - Quad-die successor?
- 2026-Q1 - Ascend 950PR - high-bandwidth memory
- 2026-Q4 - Ascend 950DT - decode inf + trn
- 2027-Q4 - Ascend 960 - double 950
- 2028-Q4 - Ascend 970 - double 960
Platforms:
- Atlas - full-stack AI solution
- 2019 - Atlas 200 - Ascend 310 Edge
- 2019 - Atlas 300 - Ascend 310 Edge
- 2019 - Atlas 500 - Ascend 310 Industrial AI
- 2020 - Atlas 800 - Ascend 910 / 910B
- 2025 - Atlas 900 A3 SuperPoD - Ascend 910C
- 2026 - Atlas 950 SuperPoD - Ascend 950PR / 950DT
- 2027 - Atlas 960 SuperPoD - Ascend 960
- 2028 - Atlas 970 - Ascend 970
- MindSpore - (Official) AI framework that leverages Ascend NPUs
- CANN - Compute Architecture for Neural Networks - Optimized runtime and libraries that allow Ascend NPUs
- FusionServer Pro AI
- 2015 - Early FusionServer (traditional server)
- 2017 - V5 - HPC focused
- 2019 - Pro - Pro added to indicate AI focus
- 2020 - V6 - Xeon 3rd Gen + AI
- Huawei Cloud AI Compute Nodes
- Atlas AI Rack Solutions
Xiaomi
Xiaomi Details
- XRING O1
- multi-core cluster
- GPU: Immortalis G925 mp16
- NPU: 6-core AI accelerator (44 TF)
AI Frameworks
-
Tensorflow by Google Github
-
PaddlePaddle by Baidu Github - Mature
-
Note: ONNX interoperability is common.
-
Note: CUDA dominant outside of Huawei.
AI Runtimes Notes
- ONNX Runtime by Microsoft
- TensorRT by NVIDIA
- CUDA Runtime by NVIDIA
- App -> TensorRT/ONNX -> cuLIB -> CUDA RT -> CUDA Drv -> GPU
- ROCm by AMD
- App -> ONNX/PyTorch/vLLM -> MIOpen/rocLIB -> HIP Runtime -> HSA Runtime -> AMDGPU Drv -> HSACO/GPU
- HIP - Heterogeneous Interface for Portability
- HSA - Heterogeneous System Architecture
- HSACO - ELF-based HSA code objects
- Instead of PTX/CUBIN, LLVM based
- HIP -> LLVM IR -> HSACO -> GPU
- Allows support for GCN, RDNA, and CDNA archs
- Metal by Apple
- Stack:
- App
- CoreML/MPSGraph/TF-Metal/PT-Metal
- Metal-API/MPS (Metal Shading Language)
- Apple GPU Runtime
- GPU (A-Series/M-series)
- Instead of PTX/CUBIN, LLVM based
- Core-ML -> MSL -> LLVM IR -> GPU ISA -> GPU
- Allows support for A/M series GPUs
- Format is a
.metallib
- Stack:
- TorchScript by PyTorch
- Format: ZIP, Pickle, Python
- Tensorflow Lite by Google
- Format: Flatbuffers
- OpenVINO by Intel
- Stack:
- ONNX/TF/PyTorch
- OpenVINO Model Optimizer
- IR (.xml + .bin)
- OpenVINO Runtime (CPU, GPU, VPU, FPGA)
- Optimized kernels executed on hardware
- Dependencies:
- Intel's Graphics Compute Runtime (OpenCL / Level Zero)
- OpenCL
- LLVM under the hood
- Stack:
- TVM by Apache
- MLIR by Google
- vLLM - OSS
- llama.cpp - OSS
- MindSpore Lite by Huawei - Uses MindIR format (in contrast to tflite's Flatbuffers)
- CANN by Huawei - Compute Architecture for Neural Networks (similar to CUDA)
- App -> MindSpore Lite -> CANN Lib -> CANN RT -> Ascend
- Paddle Inference by Baidu
- Tengine - OSS
- MNN by Alibaba
- BladeDISC by Alibaba
- MegEngine Runtime by Megvil