Specifying memory requirement for RLlib algorithms in Ray Tune etc

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am running some experiments with DQN, which needs a lot of memory for its replay buffer, and I’d like to schedule multiple experiments on the same Ray cluster. By default, Ray only looks at the number of CPU cores requested when scheduling trials, which leads to it starting more trials than I have memory for. Is there a way to tell Ray how much memory my experiment will use?

I tried setting resources_per_trial={"cpu": 1, "memory": 12000000} in tune.run(), but I am getting the error
Resources for ... have been automatically set to ... by its 'default_resource_request()' method. Please clear the 'resources_per_trial' option.
I take this to mean that you’re not supposed to override the resource request from the RLlib algorithm, but then how do you manage memory for DQN algorithms?

Hi @mgerstgrasser, the “cpu” field is merely a book keeping variable in ray and it doesn’t necessarily mean physical cpus. So by setting it you are defining what fraction of the cluster resources should be allocated to each trial. Having said that what you can do in here is to limit the number of concurrent tune experiments to run to something that you know won’t run out of memory. See https://docs.ray.io/en/latest/ray-air/tuner.html#how-to-specify-parallelism for how to specify max_concurrent_trials.

I have in mind situations where different experiments have different memory requirements. E.g. grid search over different replay buffer sizes, or benchmarking a DQN-style algorithm over different environments (where different observation spaces will lead to different memory requirements even with a fixed replay buffer size). E.g. can I tell Ray “this trial requires 10GB RAM, but this other one requires 200GB RAM”?

So I know that each tune Trainer can define the resources that it wants and inform tune about its requirements. In RLlib this is implemented in default_resource_request() of tune Trainable. The resource requirements specification should be in form of the number of CPUS / GPUS that a trial would need. We currently don’t have resource reservation as a function of the stuff you mentioned (i.e. replay buffers, observation size). Maybe you can inherit the Algorithm that you want and override the default_resource requirements as you see fit?