Optimize AI Network Performance and Efficiency

Accelerate AI data center deployments, validate SmartNIC performance, and pressure test networking components. Use real-world traffic emulators to track an array of industry-standard AI metrics — such as job completion time and collective communication bandwidth — in real time. Benchmark AI network performance, detect bottlenecks, and optimize AI workload distribution with AI-optimized network test tools, including AI workload emulators, distributed network traffic generators, and network traffic emulators.

Validate lossless Ethernet at speeds as high as 1.6T

Stay ahead of accelerating performance demands by ensuring reliable data transmission in AI / ML and high-performance computing networks.

Pressure-test AI network equipment against AI workload emulations

Reduce the need for costly GPU-based lab setups with high-density traffic generators that emulate AI workload behavior to optimize performance and efficiency.

See how AI-specific network parameters impact performance

Choose from an array of traffic models and workload profiles to simplify benchmarking and test network performance at the component and system level.

Executive Perspective: Keysight AI Solutions

Listen to Ram Periakaruppan, Vice President and General Manager of Network Applications and Security business at Keysight Technologies, discuss key challenges facing AI data centers, how to optimize AI performance and efficiency, and how Keysight’s helping with the Keysight AI portfolio of AI-ready data center solutions.

Frequently Asked Questions: AI Networks

In a traditional network, workload type and size varies, the traffic is distributed across different connections, grows proportionally with the number of users, and delayed or dropped packets do not typically cause significant problems. In an AI network, the GPUs all work on the same problem, building a large language model (LLM). The workloads to build an LLM required massive amounts of data to be shared between GPUs without dropping packets or encountering congestion. Because the GPUs are all working on the same problem, they complete a task when the last GPU finishes processing. Any delay in delivering data to one GPU means the entire workload is delayed.

Optimizing an AI network is different from optimizing a traditional data center network. AI networks run at near capacity and need to be lossless to maximize GPU utilization. Different congestion mechanisms are available with various settings. Running AI workloads in a lab setting with benchmarking tools provides a path to finding the optimal configurations and settings that can then be applied to production environments.

In an AI network, GPUs work on the same problem—only completing a task when the last GPU receives the data it needs and finishes processing. One of the key measurements of an AI network’s performance is tail latency — the flows with the longest completion times. The measurement is called P95 — the time for completion for the slowest five percent of network flows.

RDMA is an acronym that stands for Remote Direct Memory Access. RDMA allows GPUs to transfer data between each other in an AI data center with minimal involvement of the CPU and networking stacks. This allows for low-latency and high-throughput communications in an AI data center. RDMA-enabled network interface cards in a server connect to RDMA-enabled switches to enable high-speed communication between GPUs.

Ultra Ethernet (UE) adds capabilities to Ethernet to provide a fast, highly scalable, low-latency network for AI and high-performance computing requirements. Packet spraying allows flows to use more than one path to a destination, allowing improved load balancing across the network. Flexible ordering allows packets to arrive at their destination out of order. Receiver-based congestion control builds on existing sender-based congestion control mechanisms to improve in-cast congestion that occurs with AI collectives such as All-to-All. Improved telemetry allows faster control-plane signaling times, improving response to congestion events. UE is interoperable with existing data center Ethernet switches, but will run more efficiently — with higher network utilization and reduced tail latency — using UEC-based switches and network interface cards.

The movement of data among GPUs is called a Collective Operation. There are several different types, depending on the initial and final location of data and if there is a need to perform a mathematical run on the data during the process. Commonly used types are Broadcast and Gather, ReduceScatter, AllGather, AllReduce, and AlltoAll. The presence of the "reduce" keyword in the name of the operation signifies that this operation performs computations on the data. A collective operation can be implemented using any number of algorithms. Well-known algorithms for AllReduce are Unidirectional and Bidirectional Ring, Double Binary Tree, and Halving-Doubling. Each demonstrates better or worse performance depending on the number of GPUs and how they are interconnected.

Want help or have questions?