Google is so great! swift is a delay based congestion control mechanism that has applied in Google data center, which targets to maintain an extreme low latency and near-zero packet drops.
Its key ideas includes accurate RTT measurement, simple AIMD mechanism, decomposing the latency to fabric and end-host latency to work at IOPS- and throughput-intensive scenarios, allowing pacing so that cwnd can <1 to avoid congestion in incast scenario, flow and topology scaling to adapt to various levels of topo and flow.
This work is published at sigcomm2020.
Brilliant!
background & motivation
Swift is evolved from Timely. It uses simple but accurate RTT measurement to understand the congestion and queueing condition.
Datacenter workloads vary a lot. Different media has its own access time, from 100ns(DRAM) to 10ms(HDD). The quicker accessed media demands a better congestion control, especially on the fabric, while the IO-intensive workload needs more time on end-host processing, so that advertised window is a less-effective signal.
Nowadays the network stacks are usually implemented in user space or offloaded to the NIC. Swift runs in Snap (also published in sigcomm2020 by google). Swift also supports delayed-based scheme to control the rate of RDMA based on precise timestamp measurements in the 1RMA system.
Production environment is made up of several generations of switches. Congestion control relied on switch operations may have difficulty to deploy. ECN mark threshold is also hard to modify.
design
outlook
One of the biggest problem of current end-host-schemes such as DCTCP is hard to handle large scale incasts and congestion at host (i.e. IO intensive workload).
These high-level requirements guided the evolution of Swift:
(1) Provide low, tightly-bound network latency, near zero loss, and high throughput while scaling to a large datacenter across a range of workloads.
(2) Provide end-to-end congestion-control that manages congestion not only in the network fabric but also in the NIC, and on hosts, i.e., in software stacks. We call the latter endpoint (or host) congestion in this paper.
(3) Be highly CPU-efficient so as to not compromise an otherwise CPU-efficient OS bypass communication.
In detail, Swift adopts an AIMD algorithm, with the goal of maintaining the delay around a target delay. Swift decomposes the end-to-end RTT into NIC-to-NIC (fabric) and endpoint delay. The former is the forward and ACK backward transmit time, including switches queue and processing. The latter is the sum of remote NIC Rx delay and processing time and NIC Tx delay.
time measurement
The time measurement relies on timestamps on NIC and software. Swift is implemented in Pony Express, a networking stack in Snap.
When a packet is sent, the stack records the send time $t_1$. The remote NIC would mark $t_2$ when the packet lands on the NIC. When an ACK is ready to be sent, $t_4 - t_2$ will be the RX queue + processing time , appended to the packet. The hop number (by calculate the remaining TTL) is also appended for topo scaling. The overall end-to-end RTT is $t_6 - t_1$, $t_6$ is the time of stack processing ACK.
Swift disables accumulated ACK to get accurate RTT, but it is also possible since several packets may in the same batch.
Swift extends advertised window to end-host congestion window, ecwnd. Thus the effective congestion window is the minimum of fcwnd and ecwnd, each modified according to corresponding target delay.
incast
In a incast, when number of flows exceed the path BDP, even a congestion window of one is too high to prevent overload. To handle such cases, Swift allowd the congestion window to fall below one packet down to a minimum of 0.001 packets, which means in several RTT only a packet would be sent.
Pacing is not always helpful, compared with ACK-clocked window, since it is not CPU efficient. Only when the cwnd falls below 1 pacing is on, or ACK-clocking is adopted.
scaling
topo scaling is simple. The more hops, the higher target delay.
flow scaling is based on the observation that the average queue length grows as $O(\sqrt N)$ with N concurrent flows. However the sender does not know the number of flows. When N flows share a link, cwnd should be inversely proportional to N, so the target delay should be proportional to $1/\sqrt{cwnd}$.
The overall target delay is :
,
where $\alpha = \frac{fs_range}{\frac{1}{\sqrt{fs_min_cwnd}}-\frac{1}{\sqrt{fs_max_cwnd}}}$, $\beta = -\frac{\alpha}{\sqrt{fs_max_cwnd}}$.
The curve is shown in the graph below. A steep curve improves fairness and convergence by making slower flows increment their rates quickly, but increases the network queues.
Swift uses SACK and retransmission timer to find lost packet. Swift flow should use a separate QoS queue.
The overall algorithm:
measurement
this part is alleviated.
compared with GCN (modified version of DCTCP):
- Swift achieves low loss even at line-rate
- Swift achieves low latency near the target
- Swift achieves near line-rate throughput while maintaining its promise of low loss and low latency.
- Swift controls the queueing much better than GCN even though it has less preferred access to link bandwidth.
- swift can distinguish fabric and end-host delay
- …
the evaluation concerns on
- the relationship between the loss rate and port utilization/ queue utilization/port speed
- CDF of rtt ( can reach the target rtt at diff situation?)
- fcwnd & ecwnd in diff workload
- can get higher throughput if given a higher target delay?
- the effect of pacing: the throughput/ rtt/ losee rate in large scale incast
- fairness ( plus topo scaling effect)
in the end
by using sojourn time or INT may can get better result.
finally,
GOOGLE IS GOD!!!