I was wanting to try something out, I want to “Jump-Start” my agent as seen here: [2204.02372] Jump-Start Reinforcement Learning
The way they go about it is, some of the time they inject their own actions from another source other then the model. Therefore you would store the action you injected and would also send it to the environment. As time goes on you let the model take more control.
I was wondering how I would do this? I was trying to find where in the code the actions that are passed to step() are append for training, I figured that would be the best spot. However I can’t seem to find it.
I am currently using PPOConfig(), not sure if this matters.
This is how I would recommend designing jump-start in rllib.
I would create a custom environment that wraps the underlying environment and has the know transitions for the guided policy. It also keeps track of which h is being used from H and when it should transition from the guide actions to the exploration actions. While it is in the guide regime it places the guide action in the info dictionary. You could also always place the guide actions in the info dictionary and add a second entry indicating whether the policy should use it. This would let you define custom metrics to compare the known action and the exploration action.
Create a custom policy that either produces the guide action or uses the neural network policy depending on what is in the info dictionary. The policy does not return actions directly but logits for each action. For a categorical action you could return the one-hot encoding of the action and for a box action you could return a mean of desired value and std of 0.0001.
Create an rllib callback (note: not a tune callback they are different) and use the on_evaluate_end to tells the environments to adjust h based on the evaluation results. This will probably be the trickiest part to figure out because you are going to have to use the algorithm datastructures and apis to find and modify the environments on the workers. I usually have to do this in an interactive debug session.
Hopefuly this helps. I am happy to follow up if something doesn’t make sense or you get stuck.
Note: you could also probably use observation and action connectors for 1 and 2. But I don’t use them so I am not sure how I would recommend doing it with those.