How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
Question
Hi, I am new to RLlib and for now I found the API a bit confusing.
As I understood from docs, trainer.train() performs single training iteration. Although during 1 training iteration it can have a batch with multiple steps from environment.
How can control what particular environment timesteps will be used during training?
More concretely, I’d like to have a system where RL agent generate actions only for specific timesteps (and trains on them), and other steps are handled in other way (e.g. manually by user).
Is there an API when I can control environment outside the RLLib and add experience manually to agent’s buffer? Should I use Offline RL for that case?
Welcome to the RLlib discuss! This is the best place to ask questions that become searchable later. Would love to invite you to join the #rllibslack channel. Also on Tuesday we’ll have RLlib Office Hours, info pinned here in discuss.
Meantime, I’ll forward your question to someone who can answer better!
As I understood from docs, trainer.train() performs single training iteration.
Yes.
Although during 1 training iteration it can have a batch with multiple steps from environment.
Yes. It makes no sense to train on single samples in ML, just as in RL.
How can control what particular environment timesteps will be used during training
Is there an API when I can control environment outside the RLLib and add experience manually to agent’s buffer? Should I use Offline RL for that case?
It sounds like you are looking to mix on- and offline-learning.
This is possible in our DQN derivatives. I propose that you actually start with DQN, since it’s a simpler algorithm and can do what what you are discribing.
It is something that we are exploring at the moment. Alternatively, you can write your own replaybuffer, similar to the mixin replay buffer. In the next days (this PR), you will be able to read about it and start your work, assuming you work with nightly builds.
What you can do easier, is to load a buffer upfront. You will have to go into the Trainer’s setup() method, where for most algorithms the local replay buffer is initialized, and fill it with your hand-crafted experiences.
Alternatively, you can wait a couple of weeks/months and what you want to do will likely become a part of RLlib anyways.
I wonder if you could set this up as a multi-agent scenario. I believe the data gets divvied up amongst the policies according to the policy-mapping function, which maps agent_ids to policy_ids. So you can create a policy that will be trained and use that agent to generate the data you actually want to train on. Then make a heuristic policy for the manual actions.