Sync down not happening when using cloud checkpointing

iamhatesz · July 20, 2021, 1:26pm

Hi! I am working on a pipeline for running a Tune+RLlib training on AWS. Ray Cluster is being managed by Kubernetes ray-operator. Now, for a job of training a PPO agent, there is a ray-head and one ray-worker pods being created.

I used a following sync_config:

sync = tune.SyncConfig(
    upload_dir=f"s3://my-bucket",
    sync_to_driver=False,
    sync_to_cloud="aws s3 sync {source} {target}",
    node_sync_period=10,
    cloud_sync_period=10,
)

and wrapped trainable with tune.durable:

tune.tune.run(
        tune.durable("PPO"),
        # ...
        checkpoint_freq=1,
        checkpoint_at_end=True,
        keep_checkpoints_num=3,
        checkpoint_score_attr="episode_reward_mean",
        # ...
        sync_config=sync,
    )

The training works, but the checkpoints aren’t synced. They are present on worker pod, but not on the head node. In the logs I’ve observed a lot of logs like this one:

(pid=388)e[0m 2021-07-19 17:28:29,359	ERROR trial_runner.py:915 -- Trial PPO_WebEnv_7cb70_00000: Error handling checkpoint /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1
(pid=388)e[0m Traceback (most recent call last):
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 907, in _process_trial_save
(pid=388)e[0m     self._callbacks.on_checkpoint(
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/callback.py", line 216, in on_checkpoint
(pid=388)e[0m     callback.on_checkpoint(**info)
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
(pid=388)e[0m     self._sync_trial_checkpoint(trial, checkpoint)
(pid=388)e[0m   File "/root/.pyenv/versions/3.8.6/lib/python3.8/site-packages/ray/tune/syncer.py", line 428, in _sync_trial_checkpoint
(pid=388)e[0m     raise TuneError("Trial {}: Checkpoint path {} not "
(pid=388)e[0m ray.tune.error.TuneError: Trial PPO_WebEnv_7cb70_00000: Checkpoint path /root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/checkpoint_000001/checkpoint-1 not found after successful sync down.

I’ve used custom sync_to_cloud function just to avoid passing --only-show-errors flag to aws s3 sync command. In the logs I see also info about syncing up:

upload: ../root/ray_results/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp to s3://alan-system-nonproduction/training/cb9a5922/short/PPO_WebEnv_7cb70_00000_0_max_nodes=150,visited_states=True_2021-07-19_17-25-02/events.out.tfevents.1626708470.ray-cluster-training-ray-head-type-z24cp

but no info about syncing it down. Also, there is no information about syncing up checkpoints (just TensorBoard logs, experiment state, etc.). At the same time, if I use AWS CLI, the checkpoints are present in the bucket. I’ve also verified that all pods are capable of syncing with this bucket.

I’ve tested this on Ray 1.4.1 and wheels from this commit: [tune] Pass custom `sync_to_cloud` templates to durable trainables (#… · ray-project/ray@4178655 · GitHub

Do you have any idea what might be wrong?

kai · July 20, 2021, 2:08pm

Can you try setting sync_on_checkpoint=False in the SyncConfig?

Using a durable trainable should not require syncing to the head node/pod at all, since all checkpoints should live on cloud storage. But I saw this error earlier today, so if it persists, I’m happy to look more into it.

iamhatesz · July 20, 2021, 3:03pm

Hey @kai. Indeed, with sync_on_checkpoint=False the error disappeared, but the checkpoints are still not present on the head pod. Is this a valid behavior for Tune that checkpoints are stored only on one pod, in S3, but not synced down to the other pods?

root@ray-cluster-training-ray-worker-type-c5lxd:~/ray_results/short/PPO_WebEnv_7c88c_00000_0_max_nodes=150,visited_states=True_2021-07-20_17-02-22# ls -l
total 0
drwxr-xr-x 2 root root 82 Jul 20 17:05 checkpoint_000001
drwxr-xr-x 2 root root 82 Jul 20 17:06 checkpoint_000002

root@ray-cluster-training-ray-head-type-ds9jz:~/ray_results/short/PPO_WebEnv_7c88c_00000_0_max_nodes=150,visited_states=True_2021-07-20_17-02-22# ls -l
total 84
-rw-r--r-- 1 root root 21700 Jul 20 17:06 events.out.tfevents.1626793495.ray-cluster-training-ray-head-type-ds9jz
-rw-r--r-- 1 root root  5268 Jul 20 17:04 params.json
-rw-r--r-- 1 root root  4524 Jul 20 17:04 params.pkl
-rw-r--r-- 1 root root 11897 Jul 20 17:06 progress.csv
-rw-r--r-- 1 root root 31000 Jul 20 17:06 result.json

iamhatesz · July 23, 2021, 10:26am

@kai did you have a chance to take a look at this?

kai · July 23, 2021, 10:53am

Sorry for the delay. If using a durable trainable, checkpoints will only be stored on the local nodes that executed the trials and on S3.

Syncing to the driver is currently not supported for durable trainables - this is mostly because we could end up with data inconsistencies (we don’t want to continuously sync new checkpoints).

So in the case that you’d want to have access to the checkpoints locally after training, you’d have to fetch them from s3 manually.

iamhatesz · July 23, 2021, 11:01am

Thanks for the answer.

Still, there is one thing that bothers me - in case of a pod failure (e.g. when using AWS Spot instances), how does Tune recover failed workers? Is it going to spawn new workers and sync checkpoints there?

kai · July 23, 2021, 2:27pm

Yes, this is exactly what should happen!

Topic		Replies	Views
Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node Ray Tune	9	1002	April 22, 2021
Cannot save checkpoint with Ray Tune Ray Tune	1	427	January 23, 2022
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1194	April 22, 2021
Try to sync checkpoints to cloud: Sync only called once Ray Tune	5	648	April 6, 2021
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	629	February 28, 2023

Sync down not happening when using cloud checkpointing

Related topics