xAI has built the world’s largest AI supercomputer, called Colossus, using Nvidia’s Spectrum-X Ethernet fabric instead of InfiniBand. The massive system boasts 100,000 Nvidia Hopper GPUs and can train large language models at incredible speeds.
Colossus was designed to power xAI’s Grok series of large language models, including the chatbot built into Elon Musk’s echo chamber. The system is massive, with more than 2.5 times as many GPUs as the US’ number one ranked Frontier supercomputer.
Despite its impressive performance figures – including 98.9 exaFLOPS of dense FP/BF16 processing power – xAI has already begun adding another 100,000 Hopper GPUs to the cluster, which will effectively double the system’s performance.
Nvidia developed its Spectrum-X Ethernet fabric to overcome Ethernet’s limitations and achieve InfiniBand-like loss and latencies. The system uses the Spectrum SN5600 switch and individual nodes’ BlueField-3 SuperNICs, which feature a single 400GbE connection to each GPU.
In tests, xAI claimed that the system has experienced zero application latency degradation or packet loss due to flow collisions, achieving 95 percent data throughput. This compares favorably to standard Ethernet, which would have created thousands of flow collisions and achieved only 60 percent data throughput at this scale.
Source: https://www.theregister.com/2024/10/29/xai_colossus_networking/