Keysight IxNetwork VE achieves 400 Gbps throughput with DPDK Performance Acceleration
The race for higher speeds
The amount of traffic exchanged inside data centers around the world is increasing at a rapid pace and networking technologies must continue to evolve in order to meet the demand for more bandwidth. Today it is common to have 25 Gbps NICs connecting the servers to the network and 100 Gbps / 400 Gbps links connecting different switches. The next generation of hardware is already available, with 100 Gbps (and above) NICs becoming more common and 800 Gbps (and soon 1.6 Tbps) switch ports starting to appear in the core network.
Keysight continues to lead the way in Ethernet innovation. The AresONE hardware traffic generators together with the IxNetwork software already support the testing of 800 Gbps networks. Hardware traffic generators have multiple advantages such as guaranteed line rate at every frame size (including small frames such as 64 bytes) and sub-nanosecond-level latency measurement accuracy.
But there are scenarios where a hardware traffic generator cannot be used for various reasons. An alternate solution is to use a software traffic generator such as Keysight IxNetwork VE which has similar features and can be deployed as a set of virtual machines running on x86-64-based general-purpose hardware (also known as Commercial-Of-The-Shelf / COTS servers) augmented by virtualization software. You can install this solution in a multitude of private cloud / public cloud environments and successfully complement the testing performed with hardware traffic generators.
Software traffic generators have historically been used especially in scenarios where low-scale / low-performance testing was sufficient. One main roadblock against wider adoption was due to limitations in the data plane performance which were caused by several factors (such as the software traffic generator ability to craft packets fast enough and the virtualized environment ability to forward all of those packets to their destinations). These factors often resulted in a data plane performance limit of roughly 2 Mpps per vCPU core (which translates to around 1 Gbps at 64 bytes and 25 Gbps at 1518 bytes) which is more than enough for functional testing but not sufficient for high-scale / high-performance testing.
Multiple solutions are being used by the industry to address such performance bottlenecks. The most common solutions are Data Plane Development Kit (DPDK) and Single Root Input Output Virtualization (SR-IOV). DPDK is a set of software libraries that accelerate packet processing for network intensive workloads. SR-IOV is a method to bypass the virtual software switch (vSwitch) from the hypervisor by connecting the virtual machine directly to the physical NIC (pNIC). SR-IOV is an evolution of PCI Pass-Through (PCI-PT) and it allows you to partition the pNIC into multiple Virtual Functions (VF) that different virtual machines can use simultaneously (as opposed to PCI-PT where one single virtual machine controls the entire pNIC).
IxNetwork VE leverages DPDK to accelerate its own packet crafting engine and has the ability to generate traffic much faster, while also leveraging SR-IOV to bypass virtualized infrastructure bottlenecks and forward all the traffic outside of the virtual environment. Thanks to these two technologies, the data plane performance increases by a factor of 10x and is now reaching more than 20 Mpps per vCPU core. With one single server it is now easier to achieve 400 Gbps when using a modern software traffic generator like IxNetwork VE.
The test topology
For the purpose of demonstrating this performance we have selected one single-socket COTS server. The most relevant hardware components in this exercise include 1x Intel Xeon Platinum 8370C CPU (with 32 cores / 64 threads / 2.80 GHz base speed / hyperthreading enabled) and 2x NVIDIA ConnectX-6 200G NIC (with PCI Express 4.0 / SR-IOV enabled / MCX653105A-HDAT model). The virtualization software (hypervisor) powering the server is VMware vSphere ESXi 8.0.0 (build 20513097).
It is important to note the PCI Express 4.0 x16 interface (which connects the CPU and the NIC inside the server) starts to become a bottleneck at such high speeds so two separate NICs are connected to two separate PCI Express interfaces in order to increase the total bandwidth available inside the server. A new generation of hardware with support for PCI Express 5.0 (which will provide double the bandwidth between CPU / NIC comparing to PCI Express 4.0) is expected to hit the market soon.
The software traffic generator includes 1x Virtual Chassis (with 2 vCPU / 4 GB RAM) which is used for management functions and up to 8x Virtual Load Modules (with 4 vCPU / 4 GB RAM each) which are used for test functions. Each of the Virtual Load Modules has one single vNIC which is mapped to one SR-IOV VF from the pNIC resulting in up to 8x virtual test ports being used in one single test (each of the 2x 200G pNIC aggregates traffic from up to 4x test ports).
The 4 vCPU resources allocated to each Virtual Load Module are divided between 1 vCPU for TX functions, 1 vCPU for RX functions, and 2 vCPU for auxiliary functions (such as emulating the control plane protocols and processing the statistics) as described on page #5 of the IxNetwork VE Data Sheet. The 2 vCPU allocated for TX + RX are utilized at 100% while the 2 vCPU allocated for auxiliary functions are mostly idle.
This means the most intensive test executed only stresses half of the total number of CPU cores available on the server for TX + RX functions (16 cores out of 32 cores) with a quarter of server resources (8 cores out of 32 cores) being used for TX functions and another quarter of server resources (8 cores out of 32 cores) being used for RX functions. It is important to remember these aspects when analyzing the test results presented in the following section in order to understand the potential to scale to even higher speeds by further optimizing the test topology.
The test results – Scenario #1 (single flow)
The first test scenario consists of one single bidirectional flow running between one single pair of Virtual Load Modules. This topology uses a total of 4 vCPU for traffic TX+RX functions which is equivalent to 12.5% of the server resources.
The two test ports are added inside the IxNetwork application and configured with 200 Gbps line speed each for a total capacity of 2x 200 Gbps = 400 Gbps. It is important to note the DPDK Performance Acceleration feature from IxNetwork VE is enabled as it can be seen inside the GUI.
Running traffic with 9000 bytes frame size only requires 5.5 Mpps to saturate the links and achieve 400 Gbps bidirectional throughput. This is significantly below the capacity of the two Virtual Load Modules which can exceed a total of 40 Mpps at lower frame sizes.
By decreasing the frame size to 512 bytes, the links are no longer saturated with one single flow. In this case the software traffic generator is still capable of a very respectable 160 Gbps bidirectional throughput which is equivalent to roughly 37.5 Mpps.
Running one more test with the small frame size of 64 bytes will approach 27 Gbps bidirectional throughput with one single flow (equivalent to 40 Mpps). This is nowhere near the saturation of the 400 Gbps links (a hardware traffic generator is necessary for very small frame sizes) but it is much better in comparison to low-single-digit bidirectional throughput achieved when not using DPDK and SR-IOV.
The test results – Scenario #2 (multiple flows / horizontal scale)
The second test scenario explores the possibility to reach even higher performance by using multiple bidirectional flows in parallel. These flows are distributed across multiple Virtual Load Modules connected to different Virtual Functions from the same pNIC, thus the traffic is aggregated onto the same physical links as in the previous scenario. This type of scaling (by replicating one basic building block multiple times) is also known as horizontal scaling. This topology uses a total of 16 vCPU for traffic TX+RX functions which is equivalent to 50.0% of the server resources.
The eight test ports are added inside the IxNetwork application and configured with 50 Gbps line speed each for a total capacity of 8x 50 Gbps = 400 Gbps. It is once again important to note the DPDK Performance Acceleration feature from IxNetwork VE is enabled as it can be seen inside the GUI.
The test with jumbo frames will now be skipped (since the link capacity has already been saturated with one single flow) and the first test will use 512 bytes frame size. In this case, we see again how the links are saturated and 400 Gbps throughput is achieved (equivalent to almost 94 Mpps).
The last test attempts to reach maximum performance with the small frame size of 64 bytes which is now limited to around 75 Gbps bidirectional throughput (or about 112 Mpps). We observe once again the 400 Gbps links are not saturated, but a reasonable amount of the link bandwidth is filled with small frames sent by the software traffic generator.
The days when software traffic generators struggled to exceed a few Gbps of throughput are now behind us. Modern (yet well-established) technologies such as DPDK and SR-IOV enable Keysight IxNetwork VE to exceed 20 Mpps per vCPU core and saturate 2x 200 Gbps bidirectional links at frame sizes starting with 512 bytes. Another beneficial side effect of increasing throughput is the capability to measure latency with better accuracy (which we will explore in a future post).
While hardware traffic generators are getting closer to exceeding the 1 Tbps per port barrier, software traffic generators are now stepping into 400 Gbps territory at medium frame sizes and are quickly approaching 100 Gbps at minimum frame sizes.