Hi, it’s the title says.
My environment is pretty fast and light. But when it comes to training, I run trainer.train()
once - it takes 5m and it barely makes it, I run it again and the whole thing crashes.
I tested the model in a supervised training loop and it didn’t leak memory. It’s not the environment because I was generating data directly from it.
I tested the same code with a dummy model and it still crashed, so I’m convinced it’s something about RLLIB’s training loop.
I wonder if there’s a way to debug memory usage by Object. ray memory
shows 0B use, I assume because I’m in local mode, and ray dashboard is unusable.
Here is the code with no local dependencies, should reproduce.
Here’s the error trace when it finally crashes:
Finally, rllib doesn’t seem to detect my gpu, even though cuda.is_available() == True
.