Thanks for your reply.
I forgot to mention that I indeed have set local_dir to /my_directory. I did this because of “No space left on device” earlier. Both locations (~/ray_results/ and /tmp/ray) have limited storage capacity.
I thought that active trials are temporarily stored in /tmp/ray and that this might cause the error.
error log:
Traceback (most recent call last):
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tuner.py”, line 272, in fit
return self._local_tuner.fit()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py”, line 420, in fit
analysis = self._fit_internal(trainable, param_space)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py”, line 532, in _fit_internal
analysis = run(
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tune.py”, line 388, in run
_ray_auto_init()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tune.py”, line 892, in _ray_auto_init
ray.init()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 1567, in init
hook()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/registry.py”, line 241, in flush
self.references[k] = ray.put(v)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 2375, in put
object_ref = worker.put_object(value, owner_address=serialize_owner_address)
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/_private/worker.py”, line 619, in put_object
self.core_worker.put_serialized_object_and_increment_local_ref(
File “python/ray/_raylet.pyx”, line 1708, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
File “python/ray/_raylet.pyx”, line 1597, in ray._raylet.CoreWorker._create_put_buffer
File “python/ray/_raylet.pyx”, line 193, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk’s utilization is over capacity (95% by default).Tip: Use df
on this node to check disk usage and ray memory
to check object store memory usage.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “01_train_resnet18_SCV_P65.py”, line 241, in
analysis = tuner.fit()
File “/home/stiller/anaconda3/envs/CUR2/lib/python3.8/site-packages/ray/tune/tuner.py”, line 274, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use tuner = Tuner.restore("/beegfs/stiller/PatchCROP_all/Output/P_65_resnet18_SCV_no_test_L1_ALB_TR10/Tuning_resnet18_SCV_no_test_all_bayes_L1_ALB_f1_65_TR10")
.
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) dlmalloc.cc:202: Out of disk space with fallocate error: No space left on device
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) dlmalloc.cc:202: Out of disk space with fallocate error: No space left on device
(raylet) [2023-01-06 12:51:59,271 E 38384 38401] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.