1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.45.0
- Python version: 3.11.11
- OS: Linux (Ubuntu 22.04.4 LTS)
- Cloud/Infrastructure: Azure Databricks
- Other libs/tools (if relevant):
- pyarrow==14.0.1
- pyarrow-hotfix==0.6
- pytorch-forecasting==1.3.0
- pytorch-lightning==2.5.1.post0
- torch==2.3.1+cpu
- torcheval==0.0.7
- torchmetrics==1.7.1
- torchvision==0.18.1+cpu
3. What happened vs. what you expected:
- Expected: I expected to be able to persist results and checkpoints to some location on Databricks that is accessible to all workers, e.g., a DBFS path or a mount pointing to a cloud storage location. This is being set via the following code:
run_config = RunConfig(
checkpoint_config=CheckpointConfig(
num_to_keep=2,
checkpoint_score_attribute="val_loss",
checkpoint_score_order="min",
),
storage_path="/dbfs/Users/imehta@gap.com/ray_results",
# storage_path="/Volumes/gea_dev/ml_assets/ray/ray_results",
# storage_path="/dbfs/mnt/ray/ray_results",
)
- Actual: The
persist_current_checkpoint
function consistently fails in each trial when attempting to copy the workers’ checkpoints to a global location, resulting in each trial failing without saving metrics or checkpoints. I could not find documentation or examples for how to run a Ray Tune job for PyTorch Lightning on Databricks specifically, so I am not sure what other locations I should try or if these should work. This means I cannot run a Ray Tune job successfully on my PyTorch Lightning model. This same issue is occurring with the example code in the docs (Using PyTorch Lightning with Tune — Ray 2.46.0), so it is not an issue with my particular model/training code.
More info:
I have tried:
- writing to a regular DBFS path (
"/dbfs/Users/imehta@gap.com/ray_results"
), which yields this error:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_Inner.train() (pid=154146, ip=10.115.232.106, actor_id=fdd79b85173b2d6aad2711ff02000000, repr=TorchTrainer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 330, in train
raise skipped from exception_cause(skipped)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=155232, ip=10.115.232.116, actor_id=6397b90aef41886ff9b7a48602000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f1febd80090>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/util.py", line 130, in inner
return trainable(config, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.ipykernel/1847/command-3739708148611955-3319666677", line 80, in tft_ray_objective
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
call._call_and_handle_interrupt(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
self.fit_loop.run()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
self.on_advance_end()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 227, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/lightning/_lightning_utils.py", line 289, in on_train_epoch_end
train.report(metrics=metrics, checkpoint=checkpoint)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 663, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 781, in report
get_session().report(metrics, checkpoint=checkpoint)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 429, in report
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 550, in persist_current_checkpoint
_pyarrow_fs_copy_files(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 116, in _pyarrow_fs_copy_files
return pyarrow.fs.copy_files(
^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
_copy_files_selector(source_fs, source_sel,
File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: error closing file
- writing to a DBFS mount (pointing to an Azure Blob Storage container), which yields this error:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 930, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_Inner.train() (pid=1929, ip=10.115.232.112, actor_id=f37a6beadb42549d641a56b502000000, repr=TorchTrainer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 330, in train
raise skipped from exception_cause(skipped)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=1989, ip=10.115.232.105, actor_id=feef21bd6296f1cd35c0008502000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f17c757f850>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/util.py", line 130, in inner
return trainable(config, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.ipykernel/1847/command-3739708148611955-1073586516", line 80, in tft_ray_objective
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
call._call_and_handle_interrupt(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
self.fit_loop.run()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
self.on_advance_end()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 227, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/lightning/_lightning_utils.py", line 289, in on_train_epoch_end
train.report(metrics=metrics, checkpoint=checkpoint)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 663, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 781, in report
get_session().report(metrics, checkpoint=checkpoint)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 429, in report
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 550, in persist_current_checkpoint
_pyarrow_fs_copy_files(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 116, in _pyarrow_fs_copy_files
return pyarrow.fs.copy_files(
^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
_copy_files_selector(source_fs, source_sel,
File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument
- writing to a UC volume, which yields the same error as the mount location
I can write to each of these locations without issue outside of the tuning job.
Previous reports of similar/same issue that are unresolved:
- OSError when saving checkpoint with ray.train.lightning.RayTrainReportCallback
- Pyarrow error with ray & lightning on databricks
Thank you for your help in advance! It would be much appreciated if we were able to get this up and running as we are trying to evaluate Ray Tune for hyperparameter tuning org-wide.