Putting the HGX A100 8-GPU server platform together
With the GPU baseboard building block, the NVIDIA server-system partners customize the rest of the server platform to specific business needs: CPU subsystem, networking, storage, power, form factor, and node management. To deliver the highest performance, we recommend the following system design considerations:
Select two highest-end server CPUs to pair with the eight A100 GPUs, to keep up with the A100.
Use plenty of PCIe links. Use a minimum of four PCIe x16 links between the two CPUs and eight A100 GPUs, to make sure there is enough bandwidth for the CPUs to push commands and data into the A100 GPU.
For the best AI training performance at scale (many nodes together running a single training job), networking performance between nodes is critical. Use a ratio of up to 1:1 A100 GPUs to network interface cards (NICs). The Mellanox ConnectX-6 200-Gb/s NIC is the best option.
Attach NIC and NVMe storage to the PCIe switch and place it close to the A100 GPU. Use a shallow and balanced PCIe tree topology. The PCIe switch enables the fastest peer-to-peer transfer from NIC and NVMe in and out of the A100 GPU.
Adopt GPUDirect Storage, which reduces read/write latency, lowers CPU overhead, and enables higher performance.