Ray version: 2.40.0
Python: 3.11
OS: wsl2 ubuntu 24.04.1
Hello everyone,
Im trying to implement a custom RLmodule to use within a PPO. I managed to put together a class that subclasses TorchRLModule and ValueFunctionAPI, and it has working setup, _forward_exploration, _forward_inference, _forward_train, and compute_values -methods.
Now, my approach is to try to replicate the default PPO RLModule, and then continue customization from there, but the custom and default RLModules seem to learn differently.
What I observe is that the custom RLModule converges faster and to a much lower level of reward than the default module. How would I write a custom module that can be plugged in to PPOConfig.rl_module() and that would behave exactly as the default module?
Hey @termpu,
I also have been running into this problem specifically with PPO. Have you been able to track it down? I have created a few examples / PRs on the RLLIB GitHub, but I am finding that I cannot get performance from my custom models that I could from the old stack or the custom module mainly performs worse than the default RLModule.
Thanks,
Tyler
Hi @tlaurie99!
No luck so far! I tried to implement a custom ActorCriticEncoder and a CustomPPOCatalog in addition to the actual RLModule, but that I didn’t even manage to get to train. I didn’t put much effort into it, the code is mostly LLM-made. Another thing I tried was to upgrade to ray 2.44, but that didn’t help either. Based on a small number of tests, I got the feeling that the default PPO with 2.44 was doing worse compared to 2.40, I don’t know if this is true, but I reverted back to 2.40 after.
I kind of have an offline problem, but it has online characteristics, so what I’m actually trying to do is to add some dropout layers within the PPO nn.
I decided putting this aside for a while. I’ll stick to tuning the default PPO and it’s configs.