Implementing Jump Start Reinforcement Learning in RLLib

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Greetings,

I am applying the paper JSRL to visual navigation (pointnav) task using RLLib. I recently came across RLLib and it seems like a great tool. But I am new to it so could someone briefly explain to me how would I go about implementing this paper?

The paper explores fast learning using a prior policy in relatively difficult exploration/spare reward environments. I will describe the paper here in brief:

  1. The paper uses two policies (i) a previously learned guide policy (a sub-optimal policy that knows what good states are) often learned using small amounts of (offline) data, and (ii) an exploration policy which learns via RL.
  2. The goal is to enable fast learning of exploration policy given the guide policy. These policies are value-based (PPO, A2C, etc.) and thus naive initialization might not work (the paper shows experimental proof for this).
  3. Training goes as follows. You first rollout guide policy and then in the same episode rollout using exploration policy for the remaining steps. Initially you rollout more with the guide policy (say after 90% of the timesteps are complete) but this amount will gradually decrease as exploration policy gets better in the course of training.

Another question related to my implementation I have is whether there is a utility in rllib or ray to collect and save data for offline use from an Env for PointNav task. Pointnav task observations consist of images that the agent sees (could be rgb+depth), gps and compass reading.

Thank you!

1 Like

I’ve been thinking about implementing this in RLlib as well and, in the current way that RLlib works it would be somewhat difficult to do.

The issue is that we don’t have a direct interface by which you can specify a guide policy/guide data.
Are you planning on using a guide policy or guide data? You’d also have to define multiple samplers as well in this case, where the sampler mixes data from the guide policy. You could also use mix in replay to achieve this as well.

Yes. I went through the tutorials over last couple of days and it feels somewhat harder to implement in RLLib. Could I formulate this as a multi-agent problem thereby allowing multiple policies (one guide and another exploration policy)?

About mixing up of replay buffers, I’ll need a way to properly sample the data in that case for training. For example sample batches by filtering with the policy that generated that data.

Hi @manjrekarom,

Not sure if this would help but maybe you can combine these concepts to achieve Jump-start RL.

https://docs.ray.io/en/latest/rllib/rllib-concepts.html#how-to-customize-policies

https://docs.ray.io/en/latest/rllib/rllib-offline.html

Hopefully, I will also start working on implementing JSRL in July 2022. It would be great to hear about your experiences and progress on this implementation in the meantime.

@sven1977 Is the implementation of JSRL in the pipeline for any upcoming release?

Hi @vishalrangras !

Thanks for your reply. I’ll take a look at it.

It seems like DQfD/POfD algorithm. I wonder if RLlib will provide these algorithms.

For anyone visiting this thread:
There is an open feature request on our GH for this thanks to @mahuangxu.
Feel free to +1 on this or share your view of why this is important to express your need.
This way we can better track this and assess a priority.

There is no out-of-the-box super easy way. Especially in a cluster environment.
If you work locally, you can easily modify our training iteration functions to include to save batches your way there.