I am running a RayTune job distributed over a ray cluster with kubernetes.
Some of my trials are encountering the following error related to S3 access when saving checkpoints. I am confused because this trial was able to save checkpoints before this. My aws credentials are set properly via the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables. Appreciate any assistance.
Failure # 1 (occurred at 2024-02-13_12-35-45)
ray::_Inner.train() (pid=3576, ip=10.244.89.183, actor_id=1c88c5f44707f86ebf10853401000000, repr=TorchTrainer)
File "/usr/local/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 43, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(OSError): ray::_RayTrainWorker__execute.get_next() (pid=4082, ip=10.244.131.178, actor_id=7556bb2c4be6a3191cf35f9001000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f9397e02980>)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 118, in discard_return_wrapper
train_func(*args, **kwargs)
File "/opt/aframe/projects/train/train/tune/utils.py", line 175, in __call__
trainer.fit(cli.model, cli.datamodule)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 203, in run
self.on_advance_end()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 372, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=False)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/train/lightning/_lightning_utils.py", line 270, in on_train_epoch_end
train.report(metrics=metrics, checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 644, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 706, in report
_get_session().report(metrics, checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 417, in report
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 558, in persist_current_checkpoint
_pyarrow_fs_copy_files(
File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files
return pyarrow.fs.copy_files(
File "/usr/local/lib/python3.10/site-packages/pyarrow/fs.py", line 272, in copy_files
_copy_files_selector(source_fs, source_sel,
File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When uploading part for key 'helm-asha-2/TorchTrainer_2024-02-12_15-34-06/TorchTrainer_2cf91_00128_128_data_mute_prob=0.0489,data_swap_prob=0.0378,data_waveform_prob=0.2826,model_learning_rate=0.0096,mode_2024-02-12_17-00-15/checkpoint_000021/checkpoint.ckpt' in bucket 'aframe-test': AWS Error ACCESS_DENIED during UploadPart operation:
Trials from other nodes are failing with a similar error
OSError: When uploading part for key 'helm-asha-2/TorchTrainer_2024-02-12_15-34-
06/TorchTrainer_2cf91_00194_194_data_mute_prob=0.1186,data_swap_prob=0.0385,data_wavefor
m_prob=0.6407,model_learning_rate=0.0006,mode_2024-02-12_19-50-
23/checkpoint_000010/checkpoint.ckpt' in bucket 'aframe-test': AWS Error NETWORK_CONNECTION
during UploadPart operation: curlCode: 52, Server returned nothing (no headers, no data)