How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
Hi, the problem I’m trying to solve with RLlib is a POMDP and for me to be able to succeed I require good generalization. I’ve been reading various articles and papers on the best practices here. I’m quite new at RL.
I found a few interesting articles / papers:
This one proposes the following approach which seems to outperform baselines of several problems:
- Separating the RNNs in actor and critic networks. Un-sharing the weights can prevent gradient explosion, and can be the difference between the algorithm learning nothing and solving the task almost perfectly.
- Using an off-policy RL algorithm to improve sample efficiency. Using, say, TD3 instead of PPO greatly improves sample efficiency.
- Tuning the RNN context length. We found that the RNN architectures (LSTM and GRU) do not matter much, but the RNN context length (the length of the sequence fed into the RL algorithm), is crucial and depends on the task. We suggest choosing a medium length as a start.
Two and Three are easy to do in Ray. UPDATE: not so easy, I see TD3 and SAC don’t support LSTM auto-wrap. Would this be trivial to implement myself or is there a good reason that it’s currently missing from Ray (i.e. it’s hard to do).
But how would one go about 1. Separating the RNNs in actor and critic networks? Is this something you can do in RLlib?
There is also this one article which introduces an algorithm ‘LEEP’ for this usecase, which seems interesting, but it’s unfortunately not available in RLlib.