Hi! I am working on a pipeline for running a Tune+RLlib training on AWS. Ray Cluster is being managed by Kubernetes ray-operator. Now, for a job of training a PPO agent, there is a ray-head and one ray-worker pods being created.
I used a following sync_config
:
sync = tune.SyncConfig(
upload_dir=f"s3://my-bucket",
sync_to_driver=False,
sync_to_cloud="aws s3 sync {source} {target}",
node_sync_period=10,
cloud_sync_period=10,
)
and wrapped trainable with tune.durable
:
tune.tune.run(
tune.durable("PPO"),
# ...
checkpoint_freq=1,
checkpoint_at_end=True,
keep_checkpoints_num=3,
checkpoint_score_attr="episode_reward_mean",
# ...
sync_config=sync,
)
The training works, but the checkpoints aren’t synced. They are present on worker pod, but not on the head node. In the logs I’ve observed a lot of logs like this one:
(pid=388)e[0m 2021-07-19 17:28:29,359 ERROR trial_runner.py:915 -- Trial PPO_WebEnv_7cb70_00000: Error handling checkpoint /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1
(pid=388)e[0m Traceback (most recent call last):
(pid=388)e[0m File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 907, in _process_trial_save
(pid=388)e[0m self._callbacks.on_checkpoint(
(pid=388)e[0m File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/callback.py", line 216, in on_checkpoint
(pid=388)e[0m callback.on_checkpoint(**info)
(pid=388)e[0m File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
(pid=388)e[0m self._sync_trial_checkpoint(trial, checkpoint)
(pid=388)e[0m File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 428, in _sync_trial_checkpoint
(pid=388)e[0m raise TuneError("Trial {}: Checkpoint path {} not "
(pid=388)e[0m ray.tune.error.TuneError: Trial PPO_WebEnv_7cb70_00000: Checkpoint path /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1 not found after successful sync down.
I’ve used custom sync_to_cloud
function just to avoid passing --only-show-errors
flag to aws s3 sync
command. In the logs I see also info about syncing up:
upload: ../root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp to s3://alan-system-nonproduction/training/cb9a5922/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp
but no info about syncing it down. Also, there is no information about syncing up checkpoints (just TensorBoard logs, experiment state, etc.). At the same time, if I use AWS CLI, the checkpoints are present in the bucket. I’ve also verified that all pods are capable of syncing with this bucket.
I’ve tested this on Ray 1.4.1 and wheels from this commit: [tune] Pass custom `sync_to_cloud` templates to durable trainables (#… · ray-project/ray@4178655 · GitHub
Do you have any idea what might be wrong?