I’m training a fairly large model and am seeing this a lot:
(pid=933) 2021-02-12 13:05:41,265 INFO trainable.py:72 -- Checkpoint size is 1446757015 bytes
(pid=raylet) [2021-02-12 13:05:47,441 E 101 111] create_request_queue.cc:119: Not enough memory to create object fbee6bebb46e7ea09b9842a40100000001000000 after 5 tries, will return OutOfMemory to the client
2021-02-12 13:05:54,152 ERROR worker.py:980 -- Possible unhandled error from worker: ray::ImplicitFunc.save_to_object() (pid=933, ip=172.31.33.132)
File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1450, in ray._raylet.CoreWorker.store_task_outputs
File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object fbee6bebb46e7ea09b9842a40100000001000000 in object store because it is full. Object size is 1446757015 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
When running ray memory
I see that the object store has a number of references like this:
217ee3264e894e29f07f2eb50100000001000000 LOCAL_REFERENCE 1446757018 (actor call) | /opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py:save:693 | /opt/conda/lib/python3.8/site-packages/ray/tune/trial_executor.py:pause_trial:119 | /opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py:pause_trial:395
This leads me to believe that paused trails are the main cause of being low on object store memory. I assume this is unnecessary because my trials are being checkpointed and paused trials can be resumed from checkpoint. Is this expected/desired behavior and/or something I should be concerned about? If this is something that I should be concerned about, can I configure ray so that it doesn’t save paused trials to the object store?