(raylet)file_system_monitor.cc:105: - "Object creation will fail if spilling is required"

Hi everyone,

I try to tune the hyperparameters with PB2 by using this comand:
‘’’
from ray import air, tune

    class CustomStopper(tune.Stopper):
        def __init__(self):
            self.should_stop = False
    
        def __call__(self, trial_id, result):
            max_iter = 21
            if not self.should_stop and result["map"] > 0.6:
                self.should_stop = True
            return self.should_stop or result["training_iteration"] >= max_iter
    
        def stop_all(self):
            return self.should_stop
    
    stopper = CustomStopper()
    analysis = tune.run(
                train_ray_tune,
                resources_per_trial={"gpu": 1, "cpu":0},
                scheduler=pb2,
                stop=stopper,
                # PBT starts by training many neural networks in parallel with random hyperparameters. 
                config=search_space,
                verbose=2,
                checkpoint_score_attr="map",
                keep_checkpoints_num=1,
                local_dir="/content/drive/MyDrive/trash/Step2_tuning/carton/PB2/carton_raytune_pb2/", 
                name="raytune_pb2_version3")

‘’’
I already try to use local_dir to save all of my checkpoint in Google drive. But the notebook is crashed because of run out of disk !?
‘’’
(raylet) [2022-08-30 06:17:40,039 E 707 732] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-08-29_18-18-07_337177_442 is over 95% full, available space: 6109593600; capacity: 179134558208. Object creation will fail if spilling is required.
‘’’
I have checked a system and found that 115 G was used in /root
image
After I kill the process “tuning”, “/root” is liberated.

So my question is how we can limit ray tune not to use all our resource of disk in “/root” ?

Thank you !

Hi @Khoi_LE,

can you give us more details about your training function, and specifically show how you create checkpoints and how large the model is?

Can you also show which files in /root/ take up that much space?

Ray’s object spilling spills to /tmp, but checkpoints are written to ~/ray_results, so that would be a possibility. It would be good to get some clarity on that.

Hello @kai,
To response first question : “can you give us more details about your training function, and specifically show how you create checkpoints and how large the model is?”
Answer:
Here is my configuration of search space and PB2
image

Here is my configuration of tuner yolo v7 with pb2 and cross validation with the objective “find the best average mAP”

The model is about 145 mb, and each checkpoint I try to stock 3 models in each split of cross-validation (n_split =3) in checkpoint.
To response second question :“Can you also show which files in /root/ take up that much space?”
Answer:
Unfortunately, I don’t have visibility when I try to see what inside /root. Colab shows me nothing. I just found the problem is inside "/root/"by investigate each size of directory with comande “du -sh”.

To response this remark: “Ray’s object spilling spills to /tmp, but checkpoints are written to ~/ray_results, so that would be a possibility. It would be good to get some clarity on that.”

I try to set my local dir
local_dir=“/content/drive/MyDrive/trash/Step2_tuning/carton/PB2/carton_raytune_pb2/”. With 2 TB, it’s not a big problem of disk with checkpoint.
So I would rather think that the problem is not inside local dir where we stock checkpoint (in drive) but inside the virtual machine who execute and stock infomation temporaire in its disk. I already tried to do ray.init with “_temp_dir” is in google drive, but the latency of communication between colab and drive is not good enough with this configuration.

Thank you in advance !!

1 Like

Sorry for the hassle, but could you copy & paste the code e.g. into a gist? The screenshots are unfortunately downsized and very hard to read.

Okay, maybe colab sets the default temp directory to inside /root, so it could be spilled objects after all. I’ll try to take closer look at this

Hello,
This is my file gist.

I have checked the heaviest file in the root is this


And this is his content

And here is what inside DriveFS!
image

Best regard !

After some research, I think that is because of the choice “local_dir” is in Google Drive. The Google Drive File Stream require a lot of cache to update the file from local to My Drive. So maybe if we choose a local_dir at local virtual machine and do a backup copy to Google Drive after some steps, this may help us to not interact too much with Drive and avoid this kind of problem.