A circular pattern of red and white dots AI-generated content may be incorrect.

Why AI Workloads Demand System-Level Validation

Optimizing Networks for the AI Era

As artificial intelligence (AI) and machine learning (ML) drive innovation across industries, the networks supporting them are undergoing a significant transformation. Unlike traditional data centers, AI workloads demand a network architecture that prioritizes throughput, ultra-low latency, and synchronized performance across thousands of GPUs.

We sat down for a conversation with Allison Freedman, product manager at Keysight Technologies, to explore the fundamental differences between AI / ML and traditional networks, why conventional Ethernet tuning falls short, and how Keysight is helping organizations maximize the performance and efficiency of their AI infrastructure.

Q: What makes AI networks so different from traditional data center networks?

Allison: When it comes to data type, volume, and traffic patterns, AI data centers operate on a completely different scale than traditional networks. Traditional networks are designed to manage a wide range of asynchronous tasks, such as web requests, database queries, and overnight batch jobs, where occasional packet delays or losses don’t significantly impact performance.

However, AI workloads represent a fundamentally different challenge. In AI data centers, thousands of GPUs operate in parallel to train a single model, functioning like a supercomputer. Every GPU must get its data — on time, every time — because the entire job only finishes when the slowest GPU is done. So, it’s not about average performance; it’s about the long tail, the outliers that can delay the entire system.

Q: How does that impact network design and optimization?

Allison: In traditional data center networks, individual queries or scheduled jobs — including overnight jobs — are common. These workloads vary widely, and the traffic is distributed across different connections. The overall network load is evenly distributed across individual links, growing proportionally with the number of users. Delayed or dropped packets do not typically cause significant problems.

For example, traditional networks typically begin upgrades at 50% link utilization, while AI networks often operate at 90% or higher. Maximizing every bit of bandwidth requires tuning the network at the packet level, ensuring consistent, synchronized delivery across all nodes.

Q: What’s being done to keep AI traffic flowing smoothly to avoid congestion?

Allison: An extensive suite of capabilities is employed, starting with priority-based flow control (PFC) and explicit congestion notification (ECN). These are good starting points, but don’t solve the problem completely. Data Center Quantized Congestion Notification (DCQCN) helps coordinate how ECN and PFC respond to congestion, but tuning it is tricky.

Beyond that, we’re seeing the rise of packet-level load balancing, like packet spraying and cognitive routing. These technologies split traffic across all available links, which is much more effective than flow-level balancing in an AI environment. Remote Direct Memory Access (RDMA), especially over RDMA over Converged Ethernet (RoCE), is another key technology that enables GPUs to exchange data quickly without CPU involvement.

Q: When should companies consider upgrading their network for AI?

Allison: In AI, you might already be above 90% and still be looking to fine-tune before upgrading.

It’s not just about faster links — it’s about smarter links. Before scaling it out, ensure that your network fabric can manage the unique AI traffic patterns, like bursty all-to-all traffic and in-cast congestion.

Q: What business outcomes can you drive by optimizing AI network performance?

Allison: The cost of a stalled AI job isn’t just a few seconds — it’s millions of dollars in underutilized GPUs. Every network balance and throughput improvement translates directly into faster training times, improved GPU utilization, and faster time to market for your AI models. With the right tools, companies don’t have to guess what will work — they can validate and optimize before deploying — saving millions in potential inefficiencies and reducing costly trial-and-error.

Q: How is Keysight helping customers validate complex AI networks?

Allison: Validation is one of the most important and often overlooked aspects of AI network deployment. That is why we developed the Keysight AI Data Center Builder — a comprehensive solution designed to accelerate the validation and optimization of AI infrastructure.

By emulating real-world AI workloads, including large language models like GPT and Llama, the solution enables AI operators, GPU cloud providers, and infrastructure vendors to assess the performance of new algorithms, components, and protocols within their data centers. The platform supports high-density AI host emulation with 800GE / 400GE capabilities, accurately emulating AI cluster behavior without requiring extensive GPU resources.

Listen to the AI Showcase: Innovations to Scale AI Data Centers - Webinar.

Read the white paper.

Contact us to learn how to design and test high-performance networks for AI and ML workloads.