this idea and project is based on Jay, Nathan, et al. “A Deep Reinforcement Learning Perspective on Internet Congestion Control.” International Conference on Machine Learning. 2019.
paper here
Background
The complexity of CC:
- diff connections select rates in an uncoordinated, decentralized manner.
- no explicit info about competing flows(number, times of entry/exit, type of cc algo), or the network states(bandwidth, delay, loss rate, buffer size, packet-queueing policy)
- multi flow
- network varies in time
Motivation
cc algo can be viewed as a map from a locally perceived history of feedback to the next choice of sending rate.
Hypothesis: the history contains info about pattern in traffic and network state that can be exploited for better rate selection by learning the mapping via a RL approach.
- i.e. distinguish non-congestion loss from congestion loss
- i.e. adapt to variable network condition
Aurora: the RL-based cc design
- Actions: the change to sending rate (idea: how about directly sending rate?)
- States: bounded history of network statistics
- latency gradient
- latency ratio(mean latency of current MI / minimum mean latency in the connection history)
- sending ratio(acked/sent)
- Reward: function of throughput, delay, loss… anyway, depends on the requirement.
- discount factor $\gamma$: cannot be to low(at least 0.5), 0.99 results in faster learning.
NN model(even single layer) perform better than linear model.
Architecture
Input & Output:
$\alpha$is a scaling factor to dampen oscillations. $\alpha = 0.025$
Neural Network:
even the fully connected nn can produce good result.
A 2 hidden layers composed of $32 \rightarrow 16$ neurons and tanh nonlinearity is selected.Reward:
$throughput$ in packets per second, $latency$ in seconds, $loss$ is the ratio sent but not acked.
Training:
A gym environment. PPO method in gym baseline. PPO paerSome parameters
- history length: k = 2 is probably enough.
- discount factor: 0.99
Evaluation
In fact the training set is only some simulated data. The testing suite is far beyond it.
- robust in bandwidth, latency, loss rate, buffer size beyond the training scope. (so how about adaption to changing network condition?)
- first level cc algo
problem
fairness
when training in an environment competing with TCP, it can learn to make packet loss to force TCP back to slow start to free the link capacity.(Amazing!)multiple object
Adaption
when the condition is beyond the traning scope:- current robustness provided
- how about when detecting new condition, falling back to other cc algo?
- how about continually adpating(online learning?)