Hi, it’s the title says.
My environment is pretty fast and light. But when it comes to training, I run
trainer.train() once - it takes 5m and it barely makes it, I run it again and the whole thing crashes.
I tested the model in a supervised training loop and it didn’t leak memory. It’s not the environment because I was generating data directly from it.
I tested the same code with a dummy model and it still crashed, so I’m convinced it’s something about RLLIB’s training loop.
I wonder if there’s a way to debug memory usage by Object.
ray memory shows 0B use, I assume because I’m in local mode, and ray dashboard is unusable.
Here is the code with no local dependencies, should reproduce.
Here’s the error trace when it finally crashes:
Finally, rllib doesn’t seem to detect my gpu, even though
cuda.is_available() == True.