traffic control for rdma - survey

how different traffic control schemes work in traditional/rdma-enabled DCNs.

Introduction

  1. To achieve low latency and high throughput, traditional cc schemes has been applied, such as DCTCP, TIMELY, pFabric, QJUMP.

  2. To meet higher performance, RDMA(remote direct memory access) is a promising solution. It provides fast data transfer by bypassing the kernel processing in the operating system, directly exchanging data between NICs and application memory, reducing the cpu overhead.

  3. RDMA is originally designed for supercomputing networks(e.g. InfiniBand) rather than Ethernet/IP based DCNs. RDMA over converged Ethernet (RoCE) and Internet wide-area RDMA Protocol (iWARP) has been proposed to enable RDMA over Eth/IP networks. RoCEv2 depends on priority flow control (PFC) to achieve lossless Ethernet, but PFC brings many undesirable results. iWARP also results in low performance. Most existing traffic control schemes for DCNs are not compatible with RDMA since they are TCP/IP based.

    classification of traffic control schemes

how RDMA works

  1. RDMA directly uses API to send data to the RDMA NIC from the application, instead of socket and tcp/ip stack in kernel.
  2. RDMA consists of three queues.

    • send queue (SQ)
    • receive queue (RQ)
    • completion queue (CQ)
  3. When an RDMA application starts to work, it must create its three queues in the NIC, and register regions in memory for its processing. The work scheduling unit of RDMA is a queue pair (QP), which consists of one SQ and one RQ. Different QPs traverses diff paths.

  4. In the QP, a work queue element(WQE) is placed as an instruction pointing to the memory buffer for data, and the NIC executes the WQE without evolving the kernel.
  5. A CQ is used to notify the application when the transmission is done. A completion queue element(CQE) is inserted to the CQ when a WQE completes.
  6. RDMA supports two kinds of transmission semantic for WQEs:
    • channel semantic, SEND & RECV verbs
    • memory semantic, READ & WRITE verbs

traffic control in traditional RDMA

RoCEv2

With RoCEv2, an RDMA packet can be transformed into an Eth/IP/UDP packet. RoCE handles two scenarios for realizing lossless network transmission.

  • go-back-x retransmission for preventing out-of-order packets. When receiving out-of-order packet, receiver’s NIC discards the packet and requests the sender to retransmit all the data or the packets after the last ack packet. (go-back-0/go-back-N) This makes time and bandwidth wastes for redundant packets.
  • PFC for preventing packet loss. PFC is a hop-by-hop flow control mechanism to prevent buffer overflow on switches and NICs. It works in queue granularity and sends PAUSE/RESUME frames from downstream to upstream. A downstream device monitors its ingress queues, sending PAUSE/RESUME frames when the queue reaches/falls below PFC threshold. A upstream device stops/resumes the transmission. This makes a) unfair transmission for diff flow (e.g. some flows share the same queue) b) head-of-line blocking.

Most existing works trying to solve those problems focuses on modifying the NIC.

iWARP

iWARP offloads the whole TCP/IP stack to the NIC.

Compared to RoCE NIC, iWARP NIC requires more layer translations, i.e. marker protocol data unit aligned framing(MPA), direct data placement(DDP), separate RDMA protocol(RDMAP). This requires higher processing cost on NIC and achieves lower performance.

traffic control in traditional DCNs

requirement

  • delay-sensitive. usually small flows and require low latency and high burst tolerance
  • throughput-sensitive. typically large flows with high throughput demand
  1. fine-grained: diff flows should be treated differently
  2. fast detection
  3. quick start (for small flow, start at full speed)
  4. low cpu comsumption

explicit congestion notification (ECN) based

DCTCP

motivation

ECN can prevent overflowing the buffer, thus reducing packet loss.

algorithm

When the queue occupancy reaches the threshold, the switch marks subsequent packets with congestion encountered(CE) bit on IP header. Upon receiving the CE set packet, the receiver sets the ECN-Echo (ECE) bit in corresponding ACK.

  • conventional ECN: cwnd halves.
  • DCTCP: every RTT the sender calculates the ratio marked/all packets, and cwnd decreases according to the ratio.
  • other modified version of DCTCP-liked: differentiates flows according to their priority

limitation

  1. ECN-supported switches needed

  2. relies on TCP/IP stack

  3. slow start phase
  4. relies on congestion severity $\alpha$

delayed based

TIMELY

motivation

  • ECN-based only identifies the queue exceeds threshold at a single switch instead of multiple switches simultaneously.
  • ECN does not detect NIC congestion
  • queue delay experienced before triggering ECN, especially for low priority flow
  • RTT is a good measurement

algorithm

adjust the sending rate according to the gradient of RTT

limitation

  • cannot simultaneously achieve fairness or a guaranteed steady-state delay
  • TIMELY also need PFC

traffic control in RDMA-Enabled DCNs

requirements

besides the requirements mentioned before, the following requirements for efficient RDMA deployment in DCNs:

  1. low NIC resource consumption: RDMA is completely implemented by NIC
  2. easy configuration and implementation (on NICs)
  3. high performance: not depend on PFC & go-back-N

DCQCN

DCQCN combines DCTCP & quantized congestion notification (QCN).

switch send ECN (CE bit) -> receiver’s NIC periodically sends congestion notification packet (CNP) defined by RoCE -> sender send a congestion notification echoed

Sender adjusts the rate according to $R_c$, the current sending rate and $R_t$, the target rate which is the rate before last congestion feedback message arrives the sender.

Upon getting a CNP, $R_c \rightarrow R_t$, $R_c$ reduced like DCTCP. For increasing, not using ACK like TCP, in the fast recovery phase, $(R_c + R_t)/2 \rightarrow R_c, R_t$ keeps. While in the active increase phase, to quickly probe the bandwidth, $(R_c + R_t)/2 \rightarrow R_c, R_t + R_{AI} \rightarrow R_t$ .$R_{AI}$ is a constant value.

limitation

Though alleviating PFC’s limitations, DCQCN still need PFC. Carefully parameter configuration needed to reduce the trigger of PFC.

Multipath-RDMA

MP-RDMA use FPGA to simultaneously achieve multi-path and no frequent swapping data between on-chip memory and host memory through PCIe.

A UDP sport can be viewed as a virtual path, and a switch uses ECMP to pick up the path for a flow based on the UDP sport. Upon receiving a packet, an MP-RDMA ACK with virtual path ID encoded in UDP header.

MP-RDMA uses a single cwnd, which is spread to diff paths. When meeting ECN, cwnd is decreased by 1/2 of the segment, thus halving the path’s rate. Otherwise, cwnd is increased by 1/cwmd segment.

The receiver’s NIC uses bitmap-based path selection to prevent out-of-order packets. 2bits, four states: empty, received, rail, tail with completion, which is encapsulated in each packet. Only when the last slot is tail/tail-with-completion, the message is completely transmitted, and the head point will be moved.

limitation

FPGA needed, rely on PFC.

Improved RoCE NIC (IRN)

Similarly with iWARP, IRN handles the packet loss if the TCP stack in hardware, for deploying in the lossy network, to remove PFC.

SACK adopted to retransmit only loss packets. BDP is calculated by sender’s NIC, and is used to bound the inflight packets.

limitation

have to add new features to the NIC, complex scenarios need to be considered.

comparison