Hi everyone,
I am running simple RLLIB runs using Tune on a cluster that uses the MOAB workload manager (similar to SLURM).
The jobs themselves are really simple; eg. evaluating DQN on a custom variation of Breakout, so they should not need huge amounts of RAM.
Thus, when I schedule the job on the cluster, I might ask for something like 1 GPU, 2 CPUs and 16 GB of RAM per core.
However. Ray seems to think it has exclusive access to the node:
NodeManager:
InitialConfigResources: {GPU: 1.000000}, {CPU: 128.000000}, {memory: 339.085533 GiB}, {object_store_memory: 149.313754 GiB}, {node:10.16.46.69: 1.000000}, {accelerator_type:T4: 1.000000}
With DQN specifically that leads to the good old the actor died unexpectedly
error - I assume that the DQN starts to use too much memory and is thus killed by MOAB. (Interestingly, PPO, DDPG and A2C work fine.)
I tried to control the problem by specifying the resources Ray is supposed to use.
For testing purposes, I tried to run num_cpus=2, memory=2e8, object_store_memory=4e8
locally. I just picked those small values to see whether this works at all.
In the Dashboard, I find two workers. One of them is idle, the other one uses this much memory:
rss:1.47GB
vms:6.55GB
shared:284.73MB
text:1.84MB
lib:0KB
data:1.54GB
dirty:0KB
The RSS and data grew even more over training.
How does this relate to memory settings I passed to Ray? The shared memory seems to be in the order of the specified object_store_memory, but overall the worker uses much more memory than it is supposed to (and I don’t assume that the redis and raylet both need that much memory).
So much questions are:
- What exactly does the _memory parameter for Ray.init() control?
- How to ensure that Ray only uses the resources that it is allowed to on a shared node?
Thanks in advance!
Have a great day,
Jan