Introduction to AI/ML Data Center Networks

Videos

This bootcamp explores the fundamental differences between machine learning (ML) and inference within data center environments, emphasizing both conceptual understanding and practical implications. Machine learning, often referred to as the "back end" of artificial intelligence (AI), is the stage where models are trained. This process requires ingesting massive amounts of data and performing intensive matrix computations, for which GPUs (Graphics Processing Units) are particularly well-suited. Training models at this scale demands substantial computing power and time, often spanning days, weeks, or even months. Upon completion, the model gains an understanding of specific data categories, such as being able to identify a tomato and recognize its various attributes.

 

Once a model is trained, it moves to the inference phase, or "front end," where it becomes accessible to users. Inference enables real-time query responses based on the trained model, such as answering user questions like “What are the best tomatoes to use for pizza sauce?” This phase is heavily reliant on the model’s ability to generalize from training and operate efficiently at scale, especially within web-accessible platforms.

 

The session also delves into the unprecedented scale and complexity of modern AI data centers (AI DCs). These facilities are characterized by staggering investments, both in infrastructure and operational costs. Some of the largest data centers are projected to deploy hundreds of thousands of GPUs and consume upwards of 9 gigawatts of power by 2030, comparable to the output of the world’s largest nuclear reactors. Such scale brings significant capital expenditure and operational challenges, emphasizing the importance of design, efficiency, and reliability in AI infrastructure.

 

Two critical technical challenges are discussed in detail. First is the issue of GPUs waiting on data, where inter-GPU communication and synchronization can become bottlenecked by network latency. The focus isn’t just on average performance but on tail latency, particularly the 95th percentile (P95). This is because the completion of a workload across a GPU cluster is gated by the slowest participant; the next training cycle cannot begin until all GPUs finish and synchronize. Thus, optimizing network infrastructure to minimize P95 latency is crucial for ML efficiency.

 

Second, the presentation highlights the high failure rates in ML workflows, which can exceed 50% in some deployments. These failures are often due to hardware issues, faulty components, high bit error rates in interconnects, firmware glitches, software bugs, and network timeouts. Such frequent failures necessitate robust checkpointing and recovery systems to restart training without complete loss of progress. Understanding and addressing the weakest links in the system is therefore essential to maintaining throughput and reliability in AI workloads.

 

Lastly, the growing adoption of AI and increasing usage on the inference side brings additional complexity, especially in terms of scaling workloads and addressing emerging security concerns. As more individuals integrate AI into work and personal life, inference demand surges, driving the need for responsive, secure, and robust front-end infrastructure.