How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
The following error continues to prevent me from training a model with checkpointing.
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument
.
Traceback is below:
RayTaskError(OSError): ray::_Inner.train() (pid=10609, ip=10.177.64.150, actor_id=1441c87ad94da2da8fb27eac01000000, repr=TorchTrainer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
ray.get(object_ref)
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=10791, ip=10.177.64.150, actor_id=86a99bd71a4687e236d27bf801000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fd0842d0250>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
train_func(*args, **kwargs)
File "/root/.ipykernel/1652/command-6358053755949668-3569984399", line 45, in train_func
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1026, in _run_stage
self.fit_loop.run()
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 217, in run
self.on_advance_end()
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 468, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 222, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/root/.ipykernel/1652/command-4885165856204676-1174364235", line 49, in on_train_epoch_end
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 658, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 749, in report
_get_session().report(metrics, checkpoint=checkpoint)
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/session.py", line 427, in report
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 545, in persist_current_checkpoint
_pyarrow_fs_copy_files(
File "/databricks/python/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files
return pyarrow.fs.copy_files(
^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/pyarrow/fs.py", line 269, in copy_files
_copy_files_selector(source_fs, source_sel,
File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: [Errno 22] Error writing bytes to file. Detail: [errno 22] Invalid argument
Without checkpointing this model runs well and I have even successfully saved checkpoints with a custom callback method, however this method does not integrate with the checkpoint_config which allows for best_n checkpoint retention.
I am running this ray train job on Databricks using a single node and 4 GPU. I suspect that the error has to do with how databricks stores temporary directories and ray is failing because it cannot retrieve the checkpoint from the tempfile.gettempdir()
for some reason.
Have there been related issues to checkpointing on DataBricks or is this not a DBX problem and something else I’m doing?