reinforcement learning on CC

this idea and project is based on Jay, Nathan, et al. “A Deep Reinforcement Learning Perspective on Internet Congestion Control.” International Conference on Machine Learning. 2019.
paper here

Background

The complexity of CC:

diff connections select rates in an uncoordinated, decentralized manner.
no explicit info about competing flows(number, times of entry/exit, type of cc algo), or the network states(bandwidth, delay, loss rate, buffer size, packet-queueing policy)
multi flow
network varies in time

Motivation

cc algo can be viewed as a map from a locally perceived history of feedback to the next choice of sending rate.
Hypothesis: the history contains info about pattern in traffic and network state that can be exploited for better rate selection by learning the mapping via a RL approach.

i.e. distinguish non-congestion loss from congestion loss
i.e. adapt to variable network condition

Aurora: the RL-based cc design

Actions: the change to sending rate (idea: how about directly sending rate?)
States: bounded history of network statistics
- latency gradient
- latency ratio(mean latency of current MI / minimum mean latency in the connection history)
- sending ratio(acked/sent)
Reward: function of throughput, delay, loss… anyway, depends on the requirement.
discount factor $\gamma$: cannot be to low(at least 0.5), 0.99 results in faster learning.

NN model(even single layer) perform better than linear model.

Architecture

Input & Output:
$x_t = \left\{ \begin{aligned} x_{t-1}\ \ & * &(1 + \alpha a_t) & & a_t > 0\\ x_{t-1}\ \ & / &(1 - \alpha a_t) & & a_t < 0\\ \end{aligned} \right.$
$\alpha$is a scaling factor to dampen oscillations. $\alpha = 0.025$
Neural Network:
even the fully connected nn can produce good result.
A 2 hidden layers composed of $32 \rightarrow 16$ neurons and tanh nonlinearity is selected.
Reward:
$i.e. \quad reward = 10 * throughput - 1000 * latency - 2000 * loss$
$throughput$ in packets per second, $latency$ in seconds, $loss$ is the ratio sent but not acked.
Training:
A gym environment. PPO method in gym baseline. PPO paer
Some parameters
- history length: k = 2 is probably enough.
- discount factor: 0.99

Evaluation

In fact the training set is only some simulated data. The testing suite is far beyond it.

robust in bandwidth, latency, loss rate, buffer size beyond the training scope. (so how about adaption to changing network condition?)
first level cc algo

problem

fairness
when training in an environment competing with TCP, it can learn to make packet loss to force TCP back to slow start to free the link capacity.(Amazing!)
multiple object
Adaption
when the condition is beyond the traning scope:
- current robustness provided
- how about when detecting new condition, falling back to other cc algo?
- how about continually adpating(online learning?)