Persisting checkpoints in Databricks

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.45.0
  • Python version: 3.11.11
  • OS: Linux (Ubuntu 22.04.4 LTS)
  • Cloud/Infrastructure: Azure Databricks
  • Other libs/tools (if relevant):
    • pyarrow==14.0.1
    • pyarrow-hotfix==0.6
    • pytorch-forecasting==1.3.0
    • pytorch-lightning==2.5.1.post0
    • torch==2.3.1+cpu
    • torcheval==0.0.7
    • torchmetrics==1.7.1
    • torchvision==0.18.1+cpu

3. What happened vs. what you expected:

  • Expected: I expected to be able to persist results and checkpoints to some location on Databricks that is accessible to all workers, e.g., a DBFS path or a mount pointing to a cloud storage location. This is being set via the following code:
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        checkpoint_score_attribute="val_loss",
        checkpoint_score_order="min",
    ),
    storage_path="/dbfs/Users/imehta@gap.com/ray_results",
    # storage_path="/Volumes/gea_dev/ml_assets/ray/ray_results",
    # storage_path="/dbfs/mnt/ray/ray_results",
)
  • Actual: The persist_current_checkpoint function consistently fails in each trial when attempting to copy the workers’ checkpoints to a global location, resulting in each trial failing without saving metrics or checkpoints. I could not find documentation or examples for how to run a Ray Tune job for PyTorch Lightning on Databricks specifically, so I am not sure what other locations I should try or if these should work. This means I cannot run a Ray Tune job successfully on my PyTorch Lightning model. This same issue is occurring with the example code in the docs (Using PyTorch Lightning with Tune — Ray 2.46.0), so it is not an issue with my particular model/training code.

More info:
I have tried:

  • writing to a regular DBFS path ("/dbfs/Users/imehta@gap.com/ray_results"), which yields this error:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_Inner.train() (pid=154146, ip=10.115.232.106, actor_id=fdd79b85173b2d6aad2711ff02000000, repr=TorchTrainer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 330, in train
    raise skipped from exception_cause(skipped)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=155232, ip=10.115.232.116, actor_id=6397b90aef41886ff9b7a48602000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f1febd80090>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/util.py", line 130, in inner
    return trainable(config, **fn_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.ipykernel/1847/command-3739708148611955-3319666677", line 80, in tft_ray_objective
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
    call._call_and_handle_interrupt(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
    self.on_advance_end()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 227, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/lightning/_lightning_utils.py", line 289, in on_train_epoch_end
    train.report(metrics=metrics, checkpoint=checkpoint)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 663, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 781, in report
    get_session().report(metrics, checkpoint=checkpoint)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 429, in report
    persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 550, in persist_current_checkpoint
    _pyarrow_fs_copy_files(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 116, in _pyarrow_fs_copy_files
    return pyarrow.fs.copy_files(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
    _copy_files_selector(source_fs, source_sel,
  File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: error closing file
  • writing to a DBFS mount (pointing to an Azure Blob Storage container), which yields this error:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_Inner.train() (pid=1929, ip=10.115.232.112, actor_id=f37a6beadb42549d641a56b502000000, repr=TorchTrainer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 330, in train
    raise skipped from exception_cause(skipped)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=1989, ip=10.115.232.105, actor_id=feef21bd6296f1cd35c0008502000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f17c757f850>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/tune/trainable/util.py", line 130, in inner
    return trainable(config, **fn_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.ipykernel/1847/command-3739708148611955-1073586516", line 80, in tft_ray_objective
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
    call._call_and_handle_interrupt(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
    self.on_advance_end()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 227, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/lightning/_lightning_utils.py", line 289, in on_train_epoch_end
    train.report(metrics=metrics, checkpoint=checkpoint)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 663, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 781, in report
    get_session().report(metrics, checkpoint=checkpoint)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/session.py", line 429, in report
    persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 550, in persist_current_checkpoint
    _pyarrow_fs_copy_files(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a6e87890-e92a-487b-b15f-fe5c61571f8e/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 116, in _pyarrow_fs_copy_files
    return pyarrow.fs.copy_files(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
    _copy_files_selector(source_fs, source_sel,
  File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument
  • writing to a UC volume, which yields the same error as the mount location

I can write to each of these locations without issue outside of the tuning job.

Previous reports of similar/same issue that are unresolved:

Thank you for your help in advance! It would be much appreciated if we were able to get this up and running as we are trying to evaluate Ray Tune for hyperparameter tuning org-wide.

This might be an issue on the DBX side - have you tried running this same code in an Anyscale Workspace or local Ray cluster on your laptop?

If you want to try this on Anyscale, it’s straightforward to use the shared storage that comes out of the box with an Anyscale workspace.

Hi @Michael_Haines, thanks for the response!

I also think it could be related to the interaction between Ray and Databricks. I haven’t tried on another environment, but we are currently trying to get this working on Databricks as an initial POC since our ML workloads are already on that platform. Do you know if it is intended for Ray to work on Databricks? Or should we assume running on Databricks clusters is unsupported usage and that Ray-on-Spark is only meant for non-Databricks Spark clusters?

I can’t speak to whether this is intended to work or not on Databricks. My recommendation for trying to run the code elsewhere is to isolate whether it’s an issue with your application code vs. an issue with the runtime environment (e.g. Databricks).