Can Tune be configured to not keep paused trials in object store?

matthew.cox · February 12, 2021, 1:17pm

I’m training a fairly large model and am seeing this a lot:

(pid=933) 2021-02-12 13:05:41,265	INFO trainable.py:72 -- Checkpoint size is 1446757015 bytes
(pid=raylet) [2021-02-12 13:05:47,441 E 101 111] create_request_queue.cc:119: Not enough memory to create object fbee6bebb46e7ea09b9842a40100000001000000 after 5 tries, will return OutOfMemory to the client
2021-02-12 13:05:54,152	ERROR worker.py:980 -- Possible unhandled error from worker: ray::ImplicitFunc.save_to_object() (pid=933, ip=172.31.33.132)
  File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1450, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object fbee6bebb46e7ea09b9842a40100000001000000 in object store because it is full. Object size is 1446757015 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.

When running ray memory I see that the object store has a number of references like this:

217ee3264e894e29f07f2eb50100000001000000  LOCAL_REFERENCE       1446757018   (actor call)  | /opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py:save:693 | /opt/conda/lib/python3.8/site-packages/ray/tune/trial_executor.py:pause_trial:119 | /opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py:pause_trial:395

This leads me to believe that paused trails are the main cause of being low on object store memory. I assume this is unnecessary because my trials are being checkpointed and paused trials can be resumed from checkpoint. Is this expected/desired behavior and/or something I should be concerned about? If this is something that I should be concerned about, can I configure ray so that it doesn’t save paused trials to the object store?

matthew.cox · February 12, 2021, 2:35pm

To follow up - it appears that I should be concerned about this because while it recovered several times, the tuning job eventually failed due to this error.

rliaw · February 12, 2021, 6:02pm

Hmm, how many concurrent trials are you running?

matthew.cox · February 12, 2021, 6:05pm

I haven’t explicitly capped concurrent trials - I’ve got a max of 36 trials and due to resource constraints, only 13 can actually be running at once. Do you think that setting max_concurrent may help per User Guide & Configuring Tune — Ray v2.0.0.dev0 ?

rliaw · February 12, 2021, 6:40pm

Hmm, what type of search algorithm and scheduler are you using?

matthew.cox · February 12, 2021, 6:42pm

HyperBandForBOHB and TuneBOHB

rliaw · February 15, 2021, 8:08am

Hmm, I see… yeah I guess object spilling should be the right way to do this.

Topic		Replies	Views
Object Spilling useful to avoid running out of memory when using Ray Tune Ray Core	13	922	March 4, 2021
PPO Trial Runner:Process Events Pinned In Object Store RLlib	0	200	May 4, 2021
Ray using so much memory I cannot even start the tuning Ray Tune	5	2346	April 24, 2023
Ray Tune is Frozen, Large Number of Trials are Paused Ray Tune	1	453	October 26, 2021
Most runs immediately failing with "out of memory" Ray Tune	5	1237	May 11, 2021

Can Tune be configured to not keep paused trials in object store?

Related topics