Can Tune be configured to not keep paused trials in object store?

I’m training a fairly large model and am seeing this a lot:

(pid=933) 2021-02-12 13:05:41,265	INFO -- Checkpoint size is 1446757015 bytes
(pid=raylet) [2021-02-12 13:05:47,441 E 101 111] Not enough memory to create object fbee6bebb46e7ea09b9842a40100000001000000 after 5 tries, will return OutOfMemory to the client
2021-02-12 13:05:54,152	ERROR -- Possible unhandled error from worker: ray::ImplicitFunc.save_to_object() (pid=933, ip=
  File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1450, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object fbee6bebb46e7ea09b9842a40100000001000000 in object store because it is full. Object size is 1446757015 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.

When running ray memory I see that the object store has a number of references like this:

217ee3264e894e29f07f2eb50100000001000000  LOCAL_REFERENCE       1446757018   (actor call)  | /opt/conda/lib/python3.8/site-packages/ray/tune/ | /opt/conda/lib/python3.8/site-packages/ray/tune/ | /opt/conda/lib/python3.8/site-packages/ray/tune/

This leads me to believe that paused trails are the main cause of being low on object store memory. I assume this is unnecessary because my trials are being checkpointed and paused trials can be resumed from checkpoint. Is this expected/desired behavior and/or something I should be concerned about? If this is something that I should be concerned about, can I configure ray so that it doesn’t save paused trials to the object store?

To follow up - it appears that I should be concerned about this because while it recovered several times, the tuning job eventually failed due to this error.

Hmm, how many concurrent trials are you running?

I haven’t explicitly capped concurrent trials - I’ve got a max of 36 trials and due to resource constraints, only 13 can actually be running at once. Do you think that setting max_concurrent may help per User Guide & Configuring Tune — Ray v2.0.0.dev0 ?

Hmm, what type of search algorithm and scheduler are you using?

HyperBandForBOHB and TuneBOHB

Hmm, I see… yeah I guess object spilling should be the right way to do this.