How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to use PPO to train an agent in the following setting: I have a custom gym compatible environment that gives me raw sensor data. I need to encode these large observations into compact features using a pretrained, frozen neural network. Then these features will become the observations that I feed into my policy.
My issue is that I cannot figure out how to parallelize this setup. I want my gym compatible environments running in parallel, so I am vectorizing them. If I also make the feature extractor neural network part of the environment, it gets multiplied and wastes GPU memory. Ideally I would like to vectorize my gym environment, collect the samples from every environment instance, then extract the features from all the collected samples by creating a batch out of them and making a single forward pass with the frozen neural network.
So far I’m unable to find a way to do this in Ray. Things I have considered:
- Make the frozen encoder a part of the policy. This is problematic because I do not want to store the huge sensor data inside the replay buffer, it would fill my RAM very quickly. Also it would force me to encode the same samples again and again for every forward pass during training.
- Vectorize the gym environment using RLLib’s functions, then put a wrapper on it that operates on the returned batch of samples. This feels ideal, but I couldn’t figure out a way to do this without doing hacky things like accessing the internal vectorization methods of ray.
Any guidance on how to proceed would be much appreciated.
Hi @egeonat ,
I solved this problem almost 2 years ago with the execution plan API by having a second learner thread. This does not work anymore.
For what you are trying to do, that takes quite a bit of engineering today.
If you can’t wait 1-2 releases for some changes we are working on that will make this a lot easier, you can try the following:
- Collect samples with some PPO Policy and modify PPO.training_step() to save these experiences to disk.
- Create a PPO Policy with a custom model that looks you you want it to look.
- Extract the encoder part of that model from there (you can instantiate a Policy and grab it from there).
- Train the model with the experiences you saved to disk “manually” (RLlib does not offer supervised routines).
- Checkpoint the Policy with the trained encoder
- Freeze layers in your custom model
- Train with RLLib as usual from this checkpoint
(In order to better fit the observation distribution, you can regularly retrain the encoder with experiences you later collect similarly to the above)
Thanks for your response. If I am understanding your suggestion correctly, once I have the feature extractor model trained, you are suggesting that I make it a part of my policy model and freeze its weights, then train as normal.
I already have my feature extractor trained and ready so that part is handled. My concern with making the feature extractor a part of my policy is that when I am doing PPO updates and iterating over the same samples multiple times, I will be extracting the features over and over again. This is problematic as feature extraction is quite slow in my setup.
Do you think it would be possible to create a custom environment class that overrides the VectorEnv class and override the vector_step method such that I feed the collected observations into my feature extractor before returning them?
Yes, you can do this. This will make sampling slower obviously but for PPO you can simply up the number of workers, and it won’t affect the learning.