"/root" is overused by raytune and kill notebook by "run out of disk"

Hello everyone,

I try to use PB2 from raytune to tune hyperparams by this comand in env google colab:

‘’’
from ray import air, tune

  class CustomStopper(tune.Stopper):
      def __init__(self):
          self.should_stop = False
  
      def __call__(self, trial_id, result):
          max_iter = 21
          if not self.should_stop and result["map"] > 0.6:
              self.should_stop = True
          return self.should_stop or result["training_iteration"] >= max_iter
  
      def stop_all(self):
          return self.should_stop
  
  stopper = CustomStopper()
  analysis = tune.run(
              train_ray_tune,
              resources_per_trial={"gpu": 1, "cpu":0},
              scheduler=pb2,
              stop=stopper,
              # PBT starts by training many neural networks in parallel with random hyperparameters. 
              config=search_space,
              verbose=2,
              #num_samples=4,
              checkpoint_score_attr="map",
              keep_checkpoints_num=1,
              local_dir="/content/drive/MyDrive/trash/Step2_tuning/carton/PB2/carton_raytune_pb2_24_08_2022/", 
              name="carton_raytune_pb2_version3")

‘’’

After some iterations, I received an warning “run out of disk” although I have precised my local dir is in my Google Drive. I make some verification and found that the “/root” dir 's size is about 115 gigabytes.

‘’’
(raylet) [2022-08-30 06:17:40,039 E 707 732] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-08-29_18-18-07_337177_442 is over 95% full, available space: 6109593600; capacity: 179134558208. Object creation will fail if spilling is required.
115G /root/
‘’’

I try to kill the raytune process and “/root” is liberated. So I guess that “/root” is used by some temporaries work ?
My question is how that we can limit the resource used by raytune so that my disk is not run out again ?

Thank a lot for your help!

@jjyao Some session logs? Do you know how big they are expected to be?

@xwjiang2010 @jjyao Hello, do you find the reason of this problem ?
Could you please give me some solutions that I could test in this case ?
Thank you in advance !!

@Khoi_LE sorry for the late reply.

I’m trying to understand the issue. Are you saying this warning message /tmp/ray/session_2022-08-29_18-18-07_337177_442 is over 95% full, available space: 6109593600; capacity: 179134558208. Object creation will fail if spilling is required. kills the notebook? This uses /tmp/ray instead of /root.

@jjyao
I think it’s not simple in that way. The notebook has not been killed, but it was stopped because run out of disk. I have check the size of “/tmp/ray” it was a bout 3.4 GBs as I remembered. But the “/root” was about 115 GBs and it reduced (slowly to 0) when I had killed the process tuning. Therefore, in my observations, by default, there is a lot memory of disk have been used in “/root” during the process. I hope that there is a way to control when disk is almost full, it would manage the disk (by control the file not been used, etc) to solve this problem.

@Khoi_LE,

Could you check which files are using the /root space? I don’t think Ray writes things to /root.

Hi,

I think yes because when I kill process raytune, the consummations of disk is reduced. I have check the heaviest file in the root is this


And this is his content

Best regard !

After some research, I think that is because of the choice “local_dir” is in Google Drive. The Google Drive File Stream require a lot of cache to update the file from local to My Drive. So maybe if we choose a local_dir at local virtual machine and do a backup copy to Google Drive after some steps, this may help us to not interact too much with Drive and avoid this kind of problem.

1 Like