OSError when saving checkpoint with ray.train.lightning.RayTrainReportCallback

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

The following error continues to prevent me from training a model with checkpointing.
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument.

Traceback is below:

RayTaskError(OSError): ray::_Inner.train() (pid=10609, ip=10.177.64.150, actor_id=1441c87ad94da2da8fb27eac01000000, repr=TorchTrainer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=10791, ip=10.177.64.150, actor_id=86a99bd71a4687e236d27bf801000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fd0842d0250>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/root/.ipykernel/1652/command-6358053755949668-3569984399", line 45, in train_func
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
    self.fit_loop.run()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
    self.on_advance_end()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/root/.ipykernel/1652/command-4885165856204676-1174364235", line 49, in on_train_epoch_end
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 658, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 749, in report
    _get_session().report(metrics, checkpoint=checkpoint)
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 427, in report
    persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 545, in persist_current_checkpoint
    _pyarrow_fs_copy_files(
  File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files
    return pyarrow.fs.copy_files(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
    _copy_files_selector(source_fs, source_sel,
  File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument

Without checkpointing this model runs well and I have even successfully saved checkpoints with a custom callback method, however this method does not integrate with the checkpoint_config which allows for best_n checkpoint retention.

I am running this ray train job on Databricks using a single node and 4 GPU. I suspect that the error has to do with how databricks stores temporary directories and ray is failing because it cannot retrieve the checkpoint from the tempfile.gettempdir() for some reason.

Have there been related issues to checkpointing on DataBricks or is this not a DBX problem and something else I’m doing?

I just ran this job with a single GPU and it successfully ran.

Hi, glad to hear it ran on a single GPU! Does it still run into the issue when you rerun it on 4 GPU?

No still breaking when running with multiple GPU

the custom method I used simply circumvented the temporary directory step used in the RayTrainREportCallback and saved directly to unity catalog. But I didn’t use the ray.train.report method so it did not save any metadata for keeping only some checkpoints. That method worked fine with multiple GPU

Ok, after doing some looking into it, I do think you’re on the right track. I think it might be an issue with Databricks temp files <> Ray not being able to access the files.

  • Can you confirm that the temp files have write access or that Ray has sufficient permissios to access it?
  • Can you try setting the temp files into a different directory that is not temporary? Check out the RunConfig and see if you can set the storage_path variable there: ray.train.RunConfig — Ray 2.42.1

Maybe these two can help with the file writing / access issues.