Chunks and Flows, or Terminology of Collective Operations Benchmarking

When benchmarking collective operations, we’re interested in measuring KPIs using terminology tied to the application level, independent of underlying transport. We start with understanding key concepts the application operates to establish the terminology and then connect the application concepts to underlying transport implementation to help us understand relationships.

Application Layer

Collective operations in an AI cluster are performed by a Collective Operations Library (CCL). In HPC clusters, a similar role is performed by MPI; however, this document focuses on CCL, whose terminology may differ.

A diagram of a diagram AI-generated content may be incorrect.

Figure 1. CCL Layer

Let’s start with a short summary of Benchmarking Collective Operations post. A CCL’s job is to perform a Collective Operation (AllReduce, etc.) on data that resides in memory on multiple GPUs. Each CCL instance has an identifier in a cluster, called a Rank. Total number of ranks participating in the collective operation could be called Collective Size N. The amount of data in the memory of a rank is called Data Size S. In HPC/MPI applications that amount is called Message Size, but when it comes to network engineering for AI clusters such term quickly becomes overloaded, as you will see in the transport section below.

The time it takes for a collective operation to complete should be called a Collective Completion Time (CCT) to be very specific and differentiate from a larger Job Completion Time, as an AI job also has computations as well as multiple communications. We measure CCT across all the ranks as time difference between the first rank to start and the last rank to finish. In comparison, this term in HPC/MPI is called latency, but again, when it comes to network engineering, such term overlaps with packet latency.

Depending on the collective algorithm, parts of that data will need to be moved between the ranks in a fashion choreographed by the CCL. For example, for AlltoAll, it would be divided into N parts, and each part (chunk) of size C = S/N would be sent by each rank to the corresponding N destination rank. These parts can be called Algorithmic Rank-Chunks, or simpler, Rank-Chunks. The number of chunks is determined by the algorithm, and chunks are sequentially numbered on each rank. For AllReduce Ring, chunks of size C = S/N move according to the ring schedule. In these examples, the number of algorithmic rank-chunks per rank is equal to N. This property is algorithm specific and MUST NOT be assumed for all collective algorithms.

For operations that perform data reduction, the CCL must not only to receive the chunks from other ranks but also use compute (GPU) to perform the reduction function. To improve efficiency, the chunks could be subdivided into smaller, fixed-size slices, and as long as a slice is received, it could be reduced locally and then forwarded to the next rank. This approach is called a data pipeline, and the fixed-size slices are called pipelined slices.

Pipelined Slice: A fixed-size subdivision of an algorithmic rank-chunk introduced by the CCL implementation to improve the operation performance. The slice size is library-specific and can vary depending on the algorithm and other parameters.

Transport Semantics

The CCL application running on each rank needs to copy the slices to the memory of other ranks. It would call into a Transport API (e.g. RDMA) to perform that efficiently using hardware acceleration. The RDMA libraries provide transport verbs that define the semantic layer. Key function of the transport semantics layer is to enable addressing of the memory regions of the receiver so that data ends up in the correct place. The way memory addressing works depends on the library and underlying transport protocol. There are common elements among them, including buffer offset that represents displacement from the start of the receiver’s collective operation memory buffer.

A diagram of a diagram AI-generated content may be incorrect.

Figure 2. Transport Semantics Layer

We define the block of data transmitted by a single transport API call as a Transport message (RDMA message in RDMA-based transports). The amount of data transmitted by a single transport message can be called a transport message size (RDMA message size, not to be confused with MPI message size). An important consideration is if there is one-to-one mapping between pipelined slices and transport messages. Even though it should be possible to transmit each slice with a single message, there are performance optimization techniques observed on practice that split the slices into multiple messages.

Transport Protocol

There are various transport protocols that enable RDMA communications over different types of networks – InfiniBand, RoCEv2, SRD, Falcon, Ultra Ethernet are among them, and there are more.

Each of them uses their own constructs that serve as connections between Sender and Receiver NICs. InfiniBand and RoCEv2 use Queue Pairs (QP), Ultra Ethernet uses Packet Delivery Contexts (PDC). In case of QPs, they are persistent in nature and can be used to transmit multiple messages, one after another sequentially. On the other hand, PDCs are ephemeral, lack connection setup or teardown, have a short lifespan, and, depending on the transport mode (RUD, ROD and so on), enable packet spraying in the fabric by changing the UDP source ports on packet-by-packet basis.

This difference provides an insight into pipelined slices mapping into transport messages. In RoCEv2 each RDMA message is mapped to a specific QPair. Due to difficulty for a network fabric to load balance traffic of a small number of QPairs with high bandwidth, it is common to see the slices being split into smaller blocks. That way a higher number of QPairs could be used to make load balancing easier. It is not clear yet if similar splitting would provide benefits for the Ultra Ethernet, but it should not be ruled out. Therefore, when defining transport metrics we should accommodate such behavior.

A diagram of a process flow AI-generated content may be incorrect.

Transport Metrics

Since collective operations require all the ranks to finish until the operation can be considered complete, identification of root causes for straggler ranks is critical. To characterize impact of the transport performance, we need a good metric to analyze if transfer time of different chunks is approximately the same or varies significantly. Such metric would help us identify stragglers as well as ranks that take disproportional bandwidth needlessly. Since we’re aware of the optimization techniques that rely on multiple parallel data transfers to overcome load balancing challenges, we need to take that into account – some transfers may take longer than others due to a different path, or a set of paths, over a network.

Traditionally, due to in-order delivery requirements, transfers that use the same 5-tuple combination follow the same path. In network engineering, transfers that are mapped into unique 5-tuples are called network flows. Flow Completion Time, or FCT, is a well-established metric used to describe duration of such transfers. When RoCEv2 is used as a transport for a collective opration, a network flow could be defined as a sequence of transport messages that transfer data from a single algorithmic rank-chunk over the same QPair.

We recommend using FCT as a key metric that provides insights into rank-chunk transfer performance between different pairs of ranks.

For emerging protocols like Ultra Ethernet, the 5-tuple principle per connection is no longer valid due to the UDP source port no longer being always fixed. To be able to observe improvements provided by such protocols, we need a new definition of a flow that is not tied to a 5-tuple but is still compatible with RoCEv2. At least for benchmarking purposes, a Flow is a sequence of transport messages that transfer data from a single algorithmic rank-chunk in a serialized manner. This definition matches behavior of RoCEv2 messages serialized over a single QPair, as well as accommodates Ultra Ethernet transfers over one or many PDCs, as long as transport messages belonging to the same algorithmic rank-chunk do not overlap in time with each other. This definition should be practically useful as long as there is serialization of data transfers.

Summary of Terms and Definitions

Term
Definition
Metric and Measurement Unit
Collective Operation
Communication patterns that involve a group of processes to exchange data. Elemental blocks of network communications in scale-out AI/ML clusters that move data between the GPUs.
Broadcast
Gather
Scatter
ReduceScatter
AllGather
AllReduce
AlltoAll
Rank
Identifiers of endpoints exchanging messages in the collective operation.
Numbers starting with 0
Data Size, S
Size of the data one rank has as input to the operation.
Bytes
Collective Completion Time
Time it takes for Collective Operation to complete, as time difference between the first rank to start and the last rank to finish
CCT, seconds (s)
Algorithmic Rank-Chunk
Partitioning of the input data required to move it between the ranks as defined by the collective operation implementation algorithm.
C, Bytes (B)
Rank-Chunk Completion Time
Time it takes for each unique algorithmic rank-chunk transfer to complete.
RCCT, seconds (s)
Pipelined Slice
A fixed-size subdivision of an algorithmic rank-chunk introduced by the CCL implementation to improve the operation performance.
Bytes (B)
Transport Message
A block of data transmitted by a single transport API call, for example, by an RDMA Verb.
Bytes
Flow
A sequence of transport messages that transfer data from a single algorithmic rank-chunk in a serialized manner.
Bytes
Flow Completion Time
Time it takes for each unique flow to complete.
FCT, us
limit
3