The RLlib team is discussing the idea of removing some algorithms from the library so that we can better focus on improving the quality of our code base as a whole. Doing so will reduce our maintenance overhead.
The algorithms that we’re discussing removing as of now are:
simple_q (replaced by DQN)
ddpg (replaced by SAC)
A2C, A3C (replaced by PPO)
MAML and MBMPO (Not enough meta-RL users and we haven’t tested these in a while)
Please share your thoughts, thanks!
Avnish and The RLlib Team
I suggest to not remove these algorithms. Many times these algorithms are baseline if we read some articles available in the internet, so with these algorithms we can compare our results.
To be clear, we’re only thinking about removing these from future releases. Users should still be able to use these algorithms in older releases of RLlib.
That being said, is there a reason why you would choose to use an algorithm like ddpg instead of SAC, or A2C instead of A3C, when doing using RLlib for your use cases? The state of the art in these online RL algorithms has not changed in a while.
I tend to see different performance between PPO and A2C. Not sure if I’m just a difference in algorithm parameters. I use A2C the most, and would appreciate continued support for it (or for A3C).
@avnishn Many times I test different algorithms bacuse I have some “scientific approach” and I want to understand how hyperparameters of different algorithms influence on effectiveness .
Removing some algorithms from feature releases will force ray users to switch between releases and after some new changes in API will make switching very inconvinient.
I am currently using MAML and MBMPO to reduce the sim-to-real gap using meta-learning for a real-world deployment project of Traffic Signal Control. If anything, I would welcome more meta-learning algorithms and domain randomization techniques in the coming time, as they are of utmost importance when deploying RL to real-world scenarios.
For the remaining algorithms, it makes sense to remove them in future releases since better alternatives are available now.
In some domains, the synchronous vs. async implementations can make a huge difference in stability. There are many ways to increase stability of async training, but in my application, that usually means slower training. It is helpful to have the ‘sample_async’ switch. The microbatch averaging in A2C helps avoid overestimation bias when I run a minimum number of environments. It is difficult (or impossible) to find a stable/convergent config with the async algs in a minimum data setting where overfitting is likely.
Following up on what @rusu24edward and @ekblad said, I also still use A2C often in my environments.
I have found that A2C is much less data efficient, it takes more sample steps to achieve the same reward, but it is much more computationally efficient.
Even though it may take 10,000 more sampled steps for A2C to achieve the same reward as PPO, in terms of WALL time it can sample then in half the time that PPO can, so it actually finishes more quickly in term of real time paying for cloud compute.
I have also found it to be more stable than PPO for networks with memory. I am much more likely to get exploding / vanishing gradients with PPO that lead to NaNs than A2C (virutally never). Part of this is that A2C has grad norm on by default and PPO does not in RLLIB, but once I turn on grad norm for PPO the data efficiency disappears. It no longer learns faster in terms of sampled steps. Practically for me this means that A2C performs comparably to PPO and trains faster.
With respect to A2C vs A3C, A2C was developed after A3C. Here is what OpenAI says here.
Our synchronous A2C implementation performs better than our asynchronous implementations — we have not seen any evidence that the noise introduced by asynchrony provides any performance benefit. This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.
Given that you already have and are planning on keeping IMPALA, my suggestion if you decide you cannot keep both A2C and A3C would be to keep A2C and IMPALA.
Even though it was not on the list I would suggest removing MADDPG from the list of supported algorithms.
Its performance is often much worse than recent MARL algorithms. In fact for some benchmarks like SMAC recent papers do not include it anymore because of its lack of competitiveness.