Specifying memory requirement for RLlib algorithms in Ray Tune etc

mgerstgrasser · October 25, 2022, 5:11pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am running some experiments with DQN, which needs a lot of memory for its replay buffer, and I’d like to schedule multiple experiments on the same Ray cluster. By default, Ray only looks at the number of CPU cores requested when scheduling trials, which leads to it starting more trials than I have memory for. Is there a way to tell Ray how much memory my experiment will use?

I tried setting resources_per_trial={"cpu": 1, "memory": 12000000} in tune.run(), but I am getting the error
Resources for ... have been automatically set to ... by its 'default_resource_request()' method. Please clear the 'resources_per_trial' option.
I take this to mean that you’re not supposed to override the resource request from the RLlib algorithm, but then how do you manage memory for DQN algorithms?

kourosh · January 6, 2023, 11:13pm

Hi @mgerstgrasser, the “cpu” field is merely a book keeping variable in ray and it doesn’t necessarily mean physical cpus. So by setting it you are defining what fraction of the cluster resources should be allocated to each trial. Having said that what you can do in here is to limit the number of concurrent tune experiments to run to something that you know won’t run out of memory. See https://docs.ray.io/en/latest/ray-air/tuner.html#how-to-specify-parallelism for how to specify max_concurrent_trials.

mgerstgrasser · January 7, 2023, 3:50pm

I have in mind situations where different experiments have different memory requirements. E.g. grid search over different replay buffer sizes, or benchmarking a DQN-style algorithm over different environments (where different observation spaces will lead to different memory requirements even with a fixed replay buffer size). E.g. can I tell Ray “this trial requires 10GB RAM, but this other one requires 200GB RAM”?

kourosh · January 7, 2023, 7:50pm

So I know that each tune Trainer can define the resources that it wants and inform tune about its requirements. In RLlib this is implemented in default_resource_request() of tune Trainable. The resource requirements specification should be in form of the number of CPUS / GPUS that a trial would need. We currently don’t have resource reservation as a function of the stuff you mentioned (i.e. replay buffers, observation size). Maybe you can inherit the Algorithm that you want and override the default_resource requirements as you see fit?

Topic		Replies	Views
Specifying overall maximum number of cores to be used in RayTune RLlib	1	776	June 7, 2023
Ray Out of Memory Issue Ray Tune	1	201	April 30, 2024
Memory management with non-exclusive node access RLlib	3	279	October 5, 2021
Ray tune exceeding memory -- how to set limit? Ray Tune	2	1085	December 10, 2024
Accessing used resources per trial Ray Tune	2	686	July 28, 2022

Specifying memory requirement for RLlib algorithms in Ray Tune etc

Related topics