Hyperparameter Tuning with specified session directory (!=/tmp/ray/)

Hi,
I am using Ray to tune the hyper-paramters of a pytorch model.
The storage capacity on /tmp is limited, thus I want to specify the session directory (i.e. /tmp/ray → /my_directory) when working with tune.Tuner().
I found that I could specify a parameter --temp-dir or pass it somehow to ray.init().

However, I neither call ray nor ray.init() and simply do not know where to set the session directory. Any simple tips?
So far I am basically passing an object of a tune.trainable to tune.Tuner() and call tuner.fit().

Thanks for any advise!
stillsen

You most likely want to set the local_dir argument in RunConfig - Ray AIR API — Ray 2.2.0. This is where all the results and checkpoints will be saved to.

Thanks for your reply.

I forgot to mention that I indeed have set local_dir to /my_directory. I did this because of “No space left on device” earlier. Both locations (~/ray_results/ and /tmp/ray) have limited storage capacity.

I thought that active trials are temporarily stored in /tmp/ray and that this might cause the error.

error log:
Traceback (most recent call last):
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tuner.py”, line 272, in fit
return self._local_tuner.fit()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py”, line 420, in fit
analysis = self._fit_internal(trainable, param_space)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py”, line 532, in _fit_internal
analysis = run(
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tune.py”, line 388, in run
_ray_auto_init()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tune.py”, line 892, in _ray_auto_init
ray.init()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 1567, in init
hook()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/registry.py”, line 241, in flush
self.references[k] = ray.put(v)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 2375, in put
object_ref = worker.put_object(value, owner_address=serialize_owner_address)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 619, in put_object
self.core_worker.put_serialized_object_and_increment_local_ref(
File “python/ray/_raylet.pyx”, line 1708, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
File “python/ray/_raylet.pyx”, line 1597, in ray._raylet.CoreWorker._create_put_buffer
File “python/ray/_raylet.pyx”, line 193, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk’s utilization is over capacity (95% by default).Tip: Use df on this node to check disk usage and ray memory to check object store memory usage.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “01_train_resnet18_SCV_P65.py”, line 241, in
analysis = tuner.fit()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tuner.py”, line 274, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use tuner = Tuner.restore("/beegfs/stiller/PatchCROP_all/Output/P_65_resnet18_SCV_no_test_L1_ALB_TR10/Tuning_resnet18_SCV_no_test_all_bayes_L1_ALB_f1_65_TR10").
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) dlmalloc.cc:202: Out of disk space with fallocate error: No space left on device
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) dlmalloc.cc:202: Out of disk space with fallocate error: No space left on device
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.

You can call ray.init() with your arguments (temp_dir) before calling Trainer.fit() - it will inherit the current Ray session.

1 Like

Thank you for your support! ray.init(_temp_dir=‘/my_directory’) does the trick.