Rsync Error when using Ray Tune

I’m seeing a strange error when I am using Ray Tune.

2020-11-16 23:56:48,002 ERROR -- Trial WrappedDistributedTorchTrainable_57085_00001: Error handling checkpoint /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/checkpoint_7/./
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 864, in _process_trial_save
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 493, in on_checkpoint
    raise e
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 479, in on_checkpoint
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 378, in wait
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 193, in wait
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/", line 208, in wait
    args, code, error_msg))
ray.tune.error.TuneError: Sync error. Ran command: rsync  -savz -e 'ssh -i /root/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no' root@ /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/
Error message (2): protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [Receiver=3.1.2]

I am using Docker but when I ran the following, I only received a single version on multiple nodes:

import ray


import subprocess

def func():
    return subprocess.check_output(["rsync", "--version"])

print(set(ray.get([func.remote() for x in range(100)])))

I just filed a PR that should fix this issue:

Just for documentation purposes: We recently switched to docker as the default environment to run ray within autoscaled nodes. Ray Tune uses rsync to synchronize logs and checkpoints between nodes, but needs a specific syncer - the DockerSyncer - to work with autoscaler docker containers. The same is true for Kubernetes, by the way, which we also addressed in this PR.

With the PR a Docker/Kubernetes autoscaler environment is detected automatically and the correct syncer is passed.

Thanks Kai! Appreciate it!

Many thanks for this! Just ran into the same issue.


There still seems to be an issue here on sync_up when using DockerSyncer:

invalid output path: directory "/tmp/ray_tmp_mount/...

Interestingly earlier in the logs I see the directory being created:

VINFO -- Running `^[[1mmkdir -p /tmp/ray_tmp_mount/...

I’ve made a comment to an issue with a similar offending directory: