Sync down not happening when using cloud checkpointing

Hi! I am working on a pipeline for running a Tune+RLlib training on AWS. Ray Cluster is being managed by Kubernetes ray-operator. Now, for a job of training a PPO agent, there is a ray-head and one ray-worker pods being created.

I used a following sync_config:

sync = tune.SyncConfig(
    upload_dir=f"s3://my-bucket",
    sync_to_driver=False,
    sync_to_cloud="aws s3 sync {source} {target}",
    node_sync_period=10,
    cloud_sync_period=10,
)

and wrapped trainable with tune.durable:

tune.tune.run(
        tune.durable("PPO"),
        # ...
        checkpoint_freq=1,
        checkpoint_at_end=True,
        keep_checkpoints_num=3,
        checkpoint_score_attr="episode_reward_mean",
        # ...
        sync_config=sync,
    )

The training works, but the checkpoints aren’t synced. They are present on worker pod, but not on the head node. In the logs I’ve observed a lot of logs like this one:

(pid=388)e[0m 2021-07-19 17:28:29,359	ERROR trial_runner.py:915 -- Trial PPO_WebEnv_7cb70_00000: Error handling checkpoint /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1
(pid=388)e[0m Traceback (most recent call last):
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 907, in _process_trial_save
(pid=388)e[0m     self._callbacks.on_checkpoint(
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/callback.py", line 216, in on_checkpoint
(pid=388)e[0m     callback.on_checkpoint(**info)
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
(pid=388)e[0m     self._sync_trial_checkpoint(trial, checkpoint)
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 428, in _sync_trial_checkpoint
(pid=388)e[0m     raise TuneError("Trial {}: Checkpoint path {} not "
(pid=388)e[0m ray.tune.error.TuneError: Trial PPO_WebEnv_7cb70_00000: Checkpoint path /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1 not found after successful sync down.

I’ve used custom sync_to_cloud function just to avoid passing --only-show-errors flag to aws s3 sync command. In the logs I see also info about syncing up:

upload: ../root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp to s3://alan-system-nonproduction/training/cb9a5922/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp

but no info about syncing it down. Also, there is no information about syncing up checkpoints (just TensorBoard logs, experiment state, etc.). At the same time, if I use AWS CLI, the checkpoints are present in the bucket. I’ve also verified that all pods are capable of syncing with this bucket.

I’ve tested this on Ray 1.4.1 and wheels from this commit: [tune] Pass custom `sync_to_cloud` templates to durable trainables (#… · ray-project/ray@4178655 · GitHub

Do you have any idea what might be wrong?

Can you try setting sync_on_checkpoint=False in the SyncConfig?

Using a durable trainable should not require syncing to the head node/pod at all, since all checkpoints should live on cloud storage. But I saw this error earlier today, so if it persists, I’m happy to look more into it.

Hey @kai. Indeed, with sync_on_checkpoint=False the error disappeared, but the checkpoints are still not present on the head pod. Is this a valid behavior for Tune that checkpoints are stored only on one pod, in S3, but not synced down to the other pods?

root@ray-cluster-training-ray-worker-type-c5lxd:~/ray_results/short/PPO_WebEnv_7c88c_00000_0_max_nodes=150,visited_states=True_2021-07-20_17-02-22# ls -l
total 0
drwxr-xr-x 2 root root 82 Jul 20 17:05 checkpoint_000001
drwxr-xr-x 2 root root 82 Jul 20 17:06 checkpoint_000002
root@ray-cluster-training-ray-head-type-ds9jz:~/ray_results/short/PPO_WebEnv_7c88c_00000_0_max_nodes=150,visited_states=True_2021-07-20_17-02-22# ls -l
total 84
-rw-r--r-- 1 root root 21700 Jul 20 17:06 events.out.tfevents.1626793495.ray-cluster-training-ray-head-type-ds9jz
-rw-r--r-- 1 root root  5268 Jul 20 17:04 params.json
-rw-r--r-- 1 root root  4524 Jul 20 17:04 params.pkl
-rw-r--r-- 1 root root 11897 Jul 20 17:06 progress.csv
-rw-r--r-- 1 root root 31000 Jul 20 17:06 result.json

@kai did you have a chance to take a look at this?

Sorry for the delay. If using a durable trainable, checkpoints will only be stored on the local nodes that executed the trials and on S3.

Syncing to the driver is currently not supported for durable trainables - this is mostly because we could end up with data inconsistencies (we don’t want to continuously sync new checkpoints).

So in the case that you’d want to have access to the checkpoints locally after training, you’d have to fetch them from s3 manually.

Thanks for the answer.

Still, there is one thing that bothers me - in case of a pod failure (e.g. when using AWS Spot instances), how does Tune recover failed workers? Is it going to spawn new workers and sync checkpoints there?

Yes, this is exactly what should happen!