HPCC leverages in-network telemetry (INT) to control traffic precisely.

Introduction

HPCC:

adapts quickly
keeps queue empty to maintain extremely low latency
less parameters to adjust fairness & efficiency

motivation

the large RDMA deployments

Clos topology with three layers: ToR (top of rack, edge), Agg, Core.

Each PoD (point of delivery) is an independent RDMA domain. Diff PoD are interconnected by Core switches.

DCQCN, PFC, go-back-N is adopted.

target

fast converge
close-to-empty queue
few parameters
fairness
easy to deployment

some considerations:

Compared to TCP/IP, RDMA is at higher risk, since they start at line rate.
PFC has destructive impact. 1. storms 2. deadlocks 3. suppress a large number of innocent senders
DQCQN has a large number of parameters need to tune carefully, and there exists trade-offs like throughput vs stability; bandwidth vs latency and more.

Hopefully, INT has been enabled on many new switches, and programmable NIC is being capable.

design

INT supports some meta-data of each switches: ts, qlen, txBytes, B(link bandwidth).

HPCC is based on inflight bytes, which means the cwnd remains below BDP. In this situation, the queue remains empty.

The start speed is $W_{init}=B_{NIC}\times T$, where $B_{NIC}$is the NIC bandwidth. The sending rate is paced to avoid burst. The pacing rate is $R=W/T$, T is the base RTT. Note that in DCN, the base RTT is similar.

HPCC uses INT information to adjust the rate. MIMD+AI to converge quickly and keep fair.

For a link j, we have $I_j = \sum W_i$, where $W_i$ is the i-th flow’s window. Thus we need to keep $I_j \leq \eta \times B\times T$, where $\eta$ denotes the utilization of the link, usually 95%.

We can estimate the link usage by INT information. $I_j = qlen + txRate \times T$, and $txRate = \frac{ack_1.txBytes-ack_0.txBytes}{ack_1.ts-ack_0.ts}$. This emulation assumes the RTTs of all flows are the same.

In this way, we can find the bottleneck link. Each sender multiplicatively reduce its window by a factor of $k_j=I_j/\eta \times B\times T=U_j/\eta$, where $U_J$ is the normalized inflight bytes of link j:

$U_j = \frac{I_j}{B_j\times T}=\frac{qlen_j+txRate_j\times T}{B_j\times T} = \frac{qlen_j}{B_j\times T}+\frac{txRate_j}{B_j}$

Sender i react as:

$W_i=\frac{W_i}{max_j(k_j)}+W_{AI}=\frac{W_i}{max_j(U_j)/\eta}+W_{AI}$

$W_{AI}$ is the AI part, which is very small, $W_{AI} = W_{init} \times (1-\eta)/N$, where N is the expected maximum concurrent flows on a link.

To prevent overreaction is a single RTT, HPCC uses a reference window size $W_i^c$, updated each RTT.

$W_i =\frac{W_i^c}{max_j(U_j)/\eta}+W_{AI}$

As shown in the algorithm, only three parameters to set. $\eta, maxStage, W_{AI}$. $\eta$ controls the tradeoff between utilization and transient queue length. maxStage controls the tradeoff of steady state stability and the speed to reclaim free bandwidth, usually 5.$W_{AI}$ controls the tradeoff between the maximum number of concurrent flows on a link to keep near-zero queues and the speed of convergence to fairness.

hpcc

Introduction

motivation

the large RDMA deployments

target

design

implementaion