Why Your Million-Dollar GPUs Are Sleeping on the Job
Building a GPU cluster is an expensive task. This is true if you are an enterprise building an on-prem infrastructure to run your AI workload, a neocloud offering GPU as a Service, or even a hyperscaler.
Because this is such a large investment, you need to verify a reasonable ROI. One of the main parameters affecting this ROI is cost-per-million-tokens, or CPMT. This is true since the amount of tokens the infrastructure can process is usually correlated to the revenue (or benefit) related to the infrastructure, while the cost of those tokens is a mixture of CAPEX and OPEX derived from the size of the infrastructure, but also from its efficiency, or utilization.
The Gap
There is a gap, however, between the theoretical performance of the cluster and the actual performance. The theoretical performance is simply the product of a single processor (e.g., GPU) performance (in terms of FLOPS) and the number of GPUs in the cluster. The actual performance introduces another factor, and this is the amount (or ratio) of idle GPU cycles during the job run.
(Alexander Supertramp/Shutterstock)
Yes, your GPUs are sleeping on the job. The reason for that derives from the heart of parallel computing. The fact that all the GPUs in the cluster (or, at least, a large number of those) are working on the same workload and the same dataset (regardless of the parallelism type – Data, pipeline, tensor and/or expert) means that a significant element of the process involves data synchronization between those GPUs. This calls for collective operations (or collective communications) which are handled by libraries like NVIDIA’s NCCL (NVIDIA Collective Communications Library) that utilizes the network between the GPUs (i.e., the Backend Network) to distribute data across GPUs.
The Network Bottleneck
Here lies the problem. Networks have not evolved at the pace of compute. While the amount of data compute elements, such as GPUs, has increased significantly, the networking infrastructure, at all levels, failed to keep up with this pace and created a bottleneck that resulted in GPUs waiting idle for data to run through the backend network and feed their next step of computation. In fact, in some cases, hyperscalers found that this time spent on networking can reach 50% and more of the total cycles of the GPUs. This means twice the CPMT compared to the theoretical performance and twice the time for ROI.
This backend networking infrastructure, in fact, is a mixture of three types of networking use-cases. Scale-up, Scale-out and Scale-across, as illustrated in the following diagram:

- Scale-up: Originally, the protocols used inside a computer or a server to connect compute and storage elements, are now used to connect those resources across an entire rack. Up to 72 GPUs (in the current generation) can be connected with a high capacity, low latency protocol like NVlink.
- Scale-out: A much larger networking infrastructure, connecting GPUs across the entire datacenter.
- Scale-across: Connectivity across multiple datacenters, across which a single GPU cluster is spread. This is usually the case when the origin datacenter is out of power resources and further growth is needed.
The mentioned networking bottleneck is relevant to all those networking use cases, but is most noticeable in scale-out scenarios, in which classic datacenter connectivity protocols like Ethernet are not suitable for the performance requirements of backend connectivity.
The Ethernet Evolution
The reason the “classic” Ethernet protocol could not handle backend connectivity for large AI clusters is that at its base, Ethernet is a lossy protocol, i.e., when the utilization of Ethernet links gets higher, connectivity performance drop sharply and parameters like packet-loss, jitter (delay variance) and tail-latency (the latency in which the slowest packet in a session crosses the network) are spiraling out of control. This results, as mentioned, in the GPU having idle cycles and, in severe cases, in a complete reset of the workload, both of which can lead to 30%-50% degradation in workload performance (often measured in job completion time – JCT).
There are non-Ethernet solutions for scale-up and scale-out networking that perform better. The industry called for an Ethernet-based solution that would close this performance gap and enable infrastructure owners to fully utilize their investment in GPUs and shorten ROI time, reducing CPMT.
The solution was the evolution of Ethernet. The main quality missing from the original Ethernet architecture was scheduling. Adding a mechanism that will control the way Ethernet frames cross the backend fabric and ensure better fabric-links load balancing will result in lower (to none) packet-loss, very low (to zero) jitter and significantly lower tail-latency. These are key parameters to reduce job completion time and significantly reduce the amount of idle GPU cycles.
Two methods were introduced to add a scheduling mechanism to Ethernet. One is Fabric-Scheduling, in which the scheduling is done within the fabric itself, where the traffic between the top-of-rack switches and the spine switches is based on a cell-based fabric, which allows perfect load balancing and all of the derived performance benefits. The second approach is Endpoint-Scheduling, where the scheduling (or packet-spraying) is done by the network endpoints, i.e., the Network Interface Cards (NICs – within the server) that send packets to the network in a manner that reduces and bypasses congestion points within the fabric.
A Happy End
This evolution has resulted in a major uptick in GPU cluster performance. It has also changed the way GPU clusters are designed. While in the past the main focus of this design was on compute resources, today, an equal weight in the design is given to networking aspects, acknowledging the importance of this element on the overall business case of a cluster.
About the Author
Dudy Cohen serves as the VP of AI solutions at DriveNets. He has over 30 years of experience in the networking infrastructure world, accumulated from his service in various vendors and service providers, including Ceragon and Alvarion. He holds a BSc.EE. and an MSc.EE. from Tel Aviv University.
Related

