Skip to main content

2026-02-23

Discoveries

AI Hardware Research

NVIDIA

NVIDIA Details

cuda

DGX Platform - Unified AI development solution for enterprise.

  • DGX SuperPOD - All NVIDIA things in one package.
  • DGX Vera Rubin NVL72 Systems - Liquid-cooled racked systems.
  • DGX Rubin NVL8 - Turnkey platform.
    • GPUs: 8x Rubin GPUs
    • Memory / Bandwidth: 2.3TB / 160 TB/s
    • Perf: NVFP4 Inf: 400PF, NVFP4 Trn: 280PF, FP8/FP6 Trn: 140 PF
    • CPU: 2x Intel Xeon 6776P
    • NVLink Switch / Bandwidth: 4x / 28.8 TB/s
    • Power: ~24kW
    • Networking:
      • 8x OSFP (NVIDIA ConnectX-9 VPI) ports
        • Max 800 GB/s InfiniBand and Ethernet
    • SW: NVIDIA DGX OS, Ubuntu, Red Hat Enterprise Linux, Rocky
  • DGX GB300 Systems - Grace Blackwell Ultra Superchips.
    • GPUs: 72x Blackwell Ultra GPUs, 36x Grace CPUs
    • GPU Mem / Bandwidth: 20 TB / 576 TB/s
    • CPU Cores: 2592x Arm Neoverse V2 Cores
    • Fast Mem: 37 TB
    • Perf: FP4: 1440 PF/1080 PF*, FP8/FP6: 720 PF
      • *?
    • Net: 72x (ConnectX-8 VPI) OSFP ports (800 GB/s Infiniband)
    • NVLink: 9x L1 NVIDIA Switches
    • Mgmt: BMC with RJ45
    • SW: Mission Control, AI Enterprise, DGX OS / Ubuntu
  • DGX B300 Systems - Blackwell Ultra
    • GPU: 8x Blackwell Ultra SXM
      • GPU Mem: 2.1 TB
    • CPU: Intel Xeon 6776P Processors
    • Perf: FP4 - 144 PF | 108 PF, FP8: 72 PF
    • NVLink: 2x
      • 14.4 TB/s
    • Net: 8x (ConnectX-8 VPI) OSFP ports (800 GB/s Infiniband/Ethernet), 2x QSFP112 BlueField-3 DPU
    • Mgmt: 1GbE NIC with RJ45 and BMC
    • Storage: OS - 1.9TB NVMe M.2, Internal - 8x 3.84 TB NVMe E1.S
    • Power: ~14kW
    • Software:
      • AI Enterprise
      • Mission Control
      • Run:ai technology
      • DGX OS, RHEL, Rocky, Ubuntu
    • Rack Units: 10U
  • DGX GB200 Systems - Large scale Blackwell Ultra
    • GPU: 72x Blackwell GPUs, 36x Grace CPUs
      • 13.4 TB HBM3e | 576 TB/s
    • CPU Cores: 2592 Arm Neoverse V2 Cores
    • Fast Mem: 30.2 TB
    • Perf: FP4: 1440 PF | 720 PF, FP8/FP6: 720 PF | 360 PF
    • Interconnect: 72x (ConnectX-7 VPI) OSFP ports (400 GB/s Infiniband), 36x dual-port BlueField-3 VPI with 200GB/s Infiniband/Ethernet
    • NVLink Switch: 9x L1 LVLink Switches
    • Mgmt: BMC with RJ45
    • SW: Mission Control, AI Enterprise, DGX OS / Ubuntu
  • DGX B200 Systems - Blackwell Ultra
    • GPU: 8x Blackwell GPUs
      • Mem: 1440 GB, 64 TB/s HBM3e bandwidth
    • Perf: FP4: 144PF | 72 PF, FP8: 72 PF
    • NVSwitch: 2x
    • NVLink Bandwidth: 14.4 TB/s
    • Power: 14.3 kW
    • CPU: 2 Intel Xeon 8570, 112 Cores (2.1GHz/4GHz)
    • SysMem: 2-4TB
    • Net: 4x (ConnectX-7 VPI) OSFP ports (400 GB/s), 2x dual port QSFP112 BlueField-3 DPU (400 GB/s)
    • Mgmt: 10 GB/s, 100 GB/s dual port, BMC with RJ45
    • Storage: 2x 1.9 TB NVMe M.2, 8x 3.84 TB NVMe U.2
    • Software:
      • AI Enterprise - Optimized AI software
      • Mission Control - Ops and Orch with Run:ai tech
      • DGX OS / Ubuntu - OS
    • Rack Units: 10U
  • DGX H200 Systems -
    • GPU: 8x H200 (Tensor Core) GPUs
      • Mem: 1128 GB
      • Perf: FP8 - 32 PF
      • NVSwitch: 4x
      • Power: 10.2 kW
      • CPU: 2x Intel Xeon Platium 8480C, 112 cores, 2.0GHz/3.8GHz
      • SysMem: 2TB
      • Net: 4x (ConnectX-7 VPI) OSFP (400 GB/s Infiniband/Eth)
      • Mgmt: BMC via RJ45
      • Storage: 2x 1.9 TB NVMe M.2, Int: 8x 3.84 TB NVMe U.2
      • Software:
        • AI Enterprise
        • Base Command
        • DGX OS / Ubuntu / RHEL / Rocky

NVIDIA Vera CPU

  • 88 Olympus Cores
  • ARMv9.2
  • FP8 precision
  • Spatial Multithreading - enables 176 total threads through resource partitioning (in contrast to time slicing)
  • 1.2 TB/s memory bandwidth
  • Up to 1.5 TB of LPDDR5X memory
  • NVIDIA SCF (across 88 cores on single die)
  • NVIDIA NVLink-C2C - 1.8 TB/s - unified memory of CPU and GPU

Vera Rubin NVL72

  • 72 Rubin GPUs
  • 36 Vera CPUs
  • ConnectX-9 SuperNICs
  • BlueField-4 DPUs
  • NVLink 6 Switch
  • NVIDIA Quantum-X800 InfiniBand
  • Spectrum-X Ethernet

Rubin GPU

  • compression to boost NVFP4
  • 50 petaFLOPS of NVFP4
  • compatible with blackwell
  • Security ("Confidential Computing")
    • 3rd gen "trusted execution environment" across Vera, Rubin, and NVLink.
    • Attestation services with cryptographic proof of compliance
  • NVLink 6 Switch
    • 3.6 TB/s bandwidth per GPU
    • 260 TB/s of connectivity
    • NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) - reduces network congestion by 50%.
  • 2nd Gen Reliability, Availability, and Serviceability (RAS) Engine
    • ... health checks and monitoring
    • SOCAMM LPDDR5X

Platforms:

  • DGX - Turnkey AI supercomputing

  • HGX - High-density GPU building blocks

  • EGX - Edge/IoT AI

  • OVX - Omniverse Enterprise Workloads and Sims (3D focus)

  • AGX - Autonomous machines and robotics

  • IGX - Industrial and ruggedized AI deployments

  • NGC - NVIDIA GPU Cloud

  • JNX / Jetson - Edge AI modules and dev kits

  • Drive / DDX / ADX - Automotive, self-driving

  • RTX / Quadro / T - GPUs for workstations, 3D graphics, rendering

  • Maxine - Cloud AI SDKs for audio/video (virtual conferencing)

  • Mellanox - Infiniband, NVSwitch

  • CUDA - GPU Programming platform

  • TensorRT - Optimized inference runtime

  • Isaac - robotics

  • Omniverse - simulation

  • Clara - healthcare

  • Metropolis - smart cities

  • Note: XGX → “X” tells you the domain (D=Data center, H=Hyperscale, E=Edge, A=Autonomous, I=Industrial, O=Omniverse).

  • Note: Other names (Jetson, Drive, Clara) → platforms or SDKs for domain-specific hardware or software.

Components:

  • Grace - ARM-based server CPU (Neoverse V2 Cores)
  • Grace CPU Superchip - Two Grace CPUs linked with NVLink-C2C
  • Hopper - GPU Arch (pre data center) (eg. GH200)
  • Blackwell - GPU Arch (for data centers) - basis of B200
  • Blackwell Ultra - GPU Arch - basis for B300
  • Rubin - GPU Arch - successor to Blackwell (e.g. VR200).
  • Rubin Ultra - GPU Arch - Multi-die Rubin GPU
  • Vera - Arm-based CPU for integration with Rubin
  • Olympus - CPU core architecture - Cores in Vera CPUs
  • NVLink - Interconnect between CPU and GPU
  • ConnectX-9 / BlueField - Network / DPU / NIC
  • SpectrumX - Switch/Switch ASIC

NVIDIA Timeline:

  • 1997-2000 - NV3, NV4, NV10 (GPU Arch) - Initial Fixed-Function Graphics
  • 1999 - Celsius (GPU Arch) - Direct3D 7 support.
  • 2001 - Kelvin aka Geforce3/NV20 (GPU Arch)
  • 2002 - Rankine aka Geforce4/NV25/NV28 (GPU Arch)
  • 2004 - Curie aka Geforce6&7/NV40 (GPU Arch)
  • 2006 - Tesla Geforce8 (GPU Arch) - First unified shader arch.
  • 2008 - Tesla Geforce9 (GPU Arch)
  • 2008 - Tegra (Denver) (CPU Core)
  • 2009 - Tesla Geforce 100 / 200 / 300 (GPU Arch)
  • 2010 - Fermi Geforce 400 / 500 (GPU Arch) - HPC-ready
  • 2012 - Kepler Geforce 600 / 700 (GPU Arch) - Compute focused
  • 2014 - Maxwell Geforce 800 / 900 (GPU Arch) - Energy improvements
  • 2016 - Pascal "Geforce 10" (GPU Arch) - AI inf foundations
  • 2017 - Volta (GPU Arch)
  • 2018 - Xavier (CPU Core) - Robotics/AV for Jetson AGX Xavier
  • 2018 - Turing "Geforce 20" (GPU Arch) - Tensor cores + ray tracing
  • 2020 - Ampere "Geforce 30" (GPU Arch)
  • 2021 - Grace (CPU Arch)
  • 2022 - Orin (CPU Core) - Automotive/Edge w/ Jetson/DRIVE
  • 2022 - Ada Lovelace "Geforce 40" (GPU Arch) - DLSS3
  • 2022 - Grace Superchip (CPU Arch)
  • 2022 - Hopper (GPU Arch)
  • 2024 - Blackwell (GPU Arch)
  • 2025 - Blackwell Ultra (GPU Arch)
  • 2026 - Blackwell "Geforce 50" (GPU Arch)
  • 2026 - Rubin (GPU Arch)
  • 2026 - Vera (CPU Arch)
  • 2026? - N1 / N1X (CPU Arch)
  • 2027 - Rubin Ultra (GPU Arch)
  • 2028 Feynman (GPU Arch)

Geforce 50 Blackwell

  • Mem: GDDR6X / GDDR7
  • Bus: 384-512 bit bus
  • ECC: None
  • Power: ~400W
  • Limited NVLink

Data Center Blackwell

  • Mem: HBM3 / HBM3e
  • Bus: 4096 bit bus
  • ECC: Full ECC
  • Power: ~700W per GPU
  • NVLink/NVSwitch (Interconnect), MIG (Virtualization)

Alibaba

Alibaba Details
  • 2018- T-Head Semiconductor Founded

  • 2019 - Hanguang 800

  • Xuantie CPU family - RISC-V architecture

    • 2019 - Xuantie 910 - 64bit - general purpose CPU
    • 2022 - Xuantie 920 - 64bit - server acceleration
    • 2023 - Xuantie 930 - 64bit - very low power
    • 2020 - Xuantie E&C - 32bit - Xuantie E902 for IoT
  • PPU (Processing Personal Unit) - on par with Nvidia H20 and Huawei Ascend 910B

    • 96GB HBM2e memory
    • Interconnect (700GB/s)
    • PCIe support
    • 2026 - Zhenwu 810E - AI Accelerator (between Nvidia A800 and H20)

ARM

ARM Details

Immortalis GPU Family

  • G715 - ray tracing, shading
  • G720 - DVS, better mem and pwr
  • G925 - more shaders, better AI/ML support

Timeline

  • Midgard
    • 2008 - Mali-200 - basic GPU, OGL ES 2.0
    • 2010 - Mali-400 - multi-core, OGL ES 2.0
    • 2011 - Mali-T604/T624/T628 - unified shaders, OGL ES 3.0
    • 2013 - Mali-T760/T820/T880 - OGL ES 3.1
  • Valhall
    • 2017 - Mali-G71/G72/G76/G77 - ai support
    • 2022 - Immortalis‑G715 - ray tracing, var shading, AI/ML
    • 2023 - Immortalis‑G720 - Deferred Vertex Shading, imp mem and pwr 2025 - Immortalis‑G925 - 24 shader cores

Huawei

Huawei Details

Hardware

  • "Standard GPU ARM licensed cores"

    • 2014 - Kirin 910/910T - Quad-core Mali-450 MP4
    • 2014 - Kirin 920 - Octa-core Mali-T628 MP4
    • 2015 - Kirin 930/935 - Octa-core - Mali-T628 MP4
    • 2015 - Kirin 950 - Quad Cortex-A72+A53 - Mali-T880 MP4
    • 2016 - Kirin 955 / 960 - Quad Cortex-A73+A53 Mali-G71 MP8
    • 2017 - Kirin 970 - Octa-core Cortex-A73+A53 Mali-G72 MP12 + NPU
    • 2018 - Kirin 980 - Octa-core Cortex-A76+A55 Mali-G76 MP10 + NPU
    • 2018 - Kirin 710/710A Octa-core Cortex-A73+A53 Mali-G51 MP4
    • 2019 - Kirin 810 - Octa-core Cortex-A76+A55 Mali-G52 MP6 + Mini NPU 2019 - Kirin 990 4G - Octa-core Cortex-A76+A55 Mali-G76 MP16 + Dual NPU
    • 2019 - Kirin 990 5G - Octa-core Cortex-A76+A55, Mali-G76 MP16 + Dual NPU
    • 2020 - Kirin 985 - Octa-core Cortex-A76+A55 Mali-G77 MP10 + NPU
    • 2020 - Kirin 820/820E - Octa-core Cortex-A77+A55 Mali-G57 MP6 + Mini NPU
    • 2020 - Kirin 9000/9000E - Octa-core-A77+A55 Mali-G78 MP24/MP22 + Dual NPU
    • 2021 - Kirin 9000s - Octa-core Mali-G78 MP24 + Dual NPU
  • AI Accelerator Arch:

    • Davinci (GPU Arch) - Used in Ascend AI/NPU Chips
    • Ascend Configs
      • 2018 - Ascend 310 - Ultra-low power
      • 2019 - Ascend 910 - Large scale.
      • 2022 - Ascend 910B - (Domestic?) Variant of 910
      • 2025 - Ascend 910C - Multi-910 chips
      • 2026? - Ascend 910D - Quad-die successor?
      • 2026-Q1 - Ascend 950PR - high-bandwidth memory
      • 2026-Q4 - Ascend 950DT - decode inf + trn
      • 2027-Q4 - Ascend 960 - double 950
      • 2028-Q4 - Ascend 970 - double 960

Platforms:

  • Atlas - full-stack AI solution
    • 2019 - Atlas 200 - Ascend 310 Edge
    • 2019 - Atlas 300 - Ascend 310 Edge
    • 2019 - Atlas 500 - Ascend 310 Industrial AI
    • 2020 - Atlas 800 - Ascend 910 / 910B
    • 2025 - Atlas 900 A3 SuperPoD - Ascend 910C
    • 2026 - Atlas 950 SuperPoD - Ascend 950PR / 950DT
    • 2027 - Atlas 960 SuperPoD - Ascend 960
    • 2028 - Atlas 970 - Ascend 970
  • MindSpore - (Official) AI framework that leverages Ascend NPUs
  • CANN - Compute Architecture for Neural Networks - Optimized runtime and libraries that allow Ascend NPUs
  • FusionServer Pro AI
    • 2015 - Early FusionServer (traditional server)
    • 2017 - V5 - HPC focused
    • 2019 - Pro - Pro added to indicate AI focus
    • 2020 - V6 - Xeon 3rd Gen + AI
  • Huawei Cloud AI Compute Nodes
  • Atlas AI Rack Solutions

Xiaomi

Xiaomi Details
  • XRING O1
    • multi-core cluster
    • GPU: Immortalis G925 mp16
    • NPU: 6-core AI accelerator (44 TF)

AI Frameworks

AI Runtimes Notes

  • ONNX Runtime by Microsoft
  • TensorRT by NVIDIA
  • CUDA Runtime by NVIDIA
    • App -> TensorRT/ONNX -> cuLIB -> CUDA RT -> CUDA Drv -> GPU
  • ROCm by AMD
    • App -> ONNX/PyTorch/vLLM -> MIOpen/rocLIB -> HIP Runtime -> HSA Runtime -> AMDGPU Drv -> HSACO/GPU
    • HIP - Heterogeneous Interface for Portability
    • HSA - Heterogeneous System Architecture
    • HSACO - ELF-based HSA code objects
    • Instead of PTX/CUBIN, LLVM based
      • HIP -> LLVM IR -> HSACO -> GPU
      • Allows support for GCN, RDNA, and CDNA archs
  • Metal by Apple
    • Stack:
      • App
      • CoreML/MPSGraph/TF-Metal/PT-Metal
      • Metal-API/MPS (Metal Shading Language)
      • Apple GPU Runtime
      • GPU (A-Series/M-series)
    • Instead of PTX/CUBIN, LLVM based
      • Core-ML -> MSL -> LLVM IR -> GPU ISA -> GPU
      • Allows support for A/M series GPUs
    • Format is a .metallib
  • TorchScript by PyTorch
    • Format: ZIP, Pickle, Python
  • Tensorflow Lite by Google
    • Format: Flatbuffers
  • OpenVINO by Intel
    • Stack:
      • ONNX/TF/PyTorch
      • OpenVINO Model Optimizer
      • IR (.xml + .bin)
      • OpenVINO Runtime (CPU, GPU, VPU, FPGA)
      • Optimized kernels executed on hardware
    • Dependencies:
      • Intel's Graphics Compute Runtime (OpenCL / Level Zero)
      • OpenCL
      • LLVM under the hood
  • TVM by Apache
  • MLIR by Google
  • vLLM - OSS
  • llama.cpp - OSS
  • MindSpore Lite by Huawei - Uses MindIR format (in contrast to tflite's Flatbuffers)
  • CANN by Huawei - Compute Architecture for Neural Networks (similar to CUDA)
    • App -> MindSpore Lite -> CANN Lib -> CANN RT -> Ascend
  • Paddle Inference by Baidu
  • Tengine - OSS
  • MNN by Alibaba
  • BladeDISC by Alibaba
  • MegEngine Runtime by Megvil

2026-02-18

Setup Kata Containers

Download the static release for your system. It expands everything to ./opt/kata. I recommend moving the kata folder to /opt. Then create symlinks:

sudo ln -sf /opt/kata/bin/kata-runtime /usr/local/bin/kata-runtime
sudo ln -sf /opt/kata/bin/containerd-shim-kata-v2 /usr/local/bin/containerd-shim-kata-v2
sudo ln -s /opt/kata/share/defaults/kata-containers /etc/kata-containers

Run the kata-check to load and verify everything:

# Initially I had to run
sudo kata-runtime check
# Then I could run
kata-runtime check

# Optionally
kata-runtime version
kata-runtime env

Presuming your check returns System is capable of running Kata Containers, you can run the following docker command:

sudo docker run --rm -ti --runtime io.containerd.run.kata.v2 ubuntu /bin/bash

That command will do the usual downloading of an ubuntu:latest image if one is not already cached on the system. The --rm -ti is for cleanup on exit and attaching the to STDIO. Finally, the --runtime defines the runtime system to use. To docker, this is simply a string. The string is then passed to containerd where its chopped up and converted into containerd-shim-kata-v2. By default, docker uses io.containerd.run.runc.v2.

If all goes well, the above command will drop you into a bash prompt similar to:

root@50f07071b45b:/#

At this point, you are running in a qemu isolated environment.

Verifying Isolation

Try these on host, runc, and kata to see the differences:

  • uname -a
  • cat /proc/cmdline
  • mount
  • dmesg Note: works in kata, not in runc

Rerun the Kata container with privileged mode and it will have no access to the host kernel.

2026-02-06

neat-ly organize multiline strings for print()

def neat(s, indent=''):
min_spaces = 1000
lines = s.splitlines()

# Drop the first line.
if len(lines) and len(lines[0].strip()) == 0:
lines.pop(0)
# Drop the last line.
if len(lines) and len(lines[-1].strip()) == 0:
lines.pop(-1)

# Find initial indent.
for line in lines:
space_cnt = len(line) - len(line.lstrip(' '))
if len(line) > 1 and space_cnt < min_spaces:
min_spaces = space_cnt

# Remove indent.
new_lines = []
for line in s.splitlines():
new_lines.append(indent + line[min_spaces:])

return '\n'.join(new_lines)

def info_yolo5(args):
print(neat('''
Some text that I want to be clearly indented
with the code. Here is a list:
- Item 1
- Item 2
+ Another sublist
''', indent=' '))

Output:

  Some text that I want to be clearly indented
with the code. Here is a list:
- Item 1
- Item 2
+ Another sublist

Yolo Generation via Ultralytics Docker

From outside container:

docker pull ultralytics/ultralytics:8.4.8-python-export

docker run -it --rm \
-v $(pwd):/workspace \
ultralytics/ultralytics:8.4.8-python-export \
bash

From inside container:

cd /workspace
# Dump a tflite (implicitly based on onnx opset 22)
yolo export model=/workspace/yolov5su.pt format=tflite
# Explicitly dump onnx with opset 14.
yolo export model=yolov5su.pt format=onnx opset=14

2026-02-02

  • Wrote blog article: A Better Do File

  • Added the /journal blog to vinnie.work. This is intended to be more of a local commit message for daily activity. In otherwords, I intend this space to be for activity that hasn't really been thought out or considered in any particular context. (Likely duplicates /stream, that I've never actually used.)