Memory management with non-exclusive node access

Hi everyone,

I am running simple RLLIB runs using Tune on a cluster that uses the MOAB workload manager (similar to SLURM).

The jobs themselves are really simple; eg. evaluating DQN on a custom variation of Breakout, so they should not need huge amounts of RAM.

Thus, when I schedule the job on the cluster, I might ask for something like 1 GPU, 2 CPUs and 16 GB of RAM per core.

However. Ray seems to think it has exclusive access to the node:

InitialConfigResources: {GPU: 1.000000}, {CPU: 128.000000}, {memory: 339.085533 GiB}, {object_store_memory: 149.313754 GiB}, {node: 1.000000}, {accelerator_type:T4: 1.000000}

With DQN specifically that leads to the good old the actor died unexpectedly error - I assume that the DQN starts to use too much memory and is thus killed by MOAB. (Interestingly, PPO, DDPG and A2C work fine.)

I tried to control the problem by specifying the resources Ray is supposed to use.
For testing purposes, I tried to run num_cpus=2, memory=2e8, object_store_memory=4e8 locally. I just picked those small values to see whether this works at all.

In the Dashboard, I find two workers. One of them is idle, the other one uses this much memory:

The RSS and data grew even more over training.

How does this relate to memory settings I passed to Ray? The shared memory seems to be in the order of the specified object_store_memory, but overall the worker uses much more memory than it is supposed to (and I don’t assume that the redis and raylet both need that much memory).

So much questions are:

  1. What exactly does the _memory parameter for Ray.init() control?
  2. How to ensure that Ray only uses the resources that it is allowed to on a shared node?

Thanks in advance!

Have a great day,

Currently, memory resources specification is just used for bookeeping, but it doesn’t enforce the memory usage of each worker. That says, imagine you have 4GB of memory for Ray, and create 2 actors with 2 GB memory each. This will schedule 2 actors on that node, but if one of actor uses 4GB of memory, that still can crash the node.

That says, to resolve the issue, you should reduce the RSS consumption of the worker (which could be application specific in this case)

Another tip from the RLlib team is to reduce the replay buffer size of your DQN agent, so it consumes less memory.
In case a hard limit is not feasible.

Thank you two for the clarifications.

I haven’t specifically set DQN’s buffer size, hence it should be at the default 50k. That’s also why I was wondering that this problem occurs at all.