Rsync Error when using Ray Tune

I’m seeing a strange error when I am using Ray Tune.

2020-11-16 23:56:48,002 ERROR trial_runner.py:868 -- Trial WrappedDistributedTorchTrainable_57085_00001: Error handling checkpoint /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/checkpoint_7/./
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 864, in _process_trial_save
    trial.on_checkpoint(trial.saving_to)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py", line 493, in on_checkpoint
    raise e
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py", line 479, in on_checkpoint
    self.result_logger.wait()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/logger.py", line 378, in wait
    self._log_syncer.wait()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 193, in wait
    self.sync_client.wait()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 208, in wait
    args, code, error_msg))
ray.tune.error.TuneError: Sync error. Ran command: rsync  -savz -e 'ssh -i /root/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no' root@172.31.217.164:/root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/ /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/
Error message (2): protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [Receiver=3.1.2]

I am using Docker but when I ran the following, I only received a single version on multiple nodes:

import ray

ray.init()
print(ray.available_resources())

import subprocess


@ray.remote
def func():
    return subprocess.check_output(["rsync", "--version"])


print(set(ray.get([func.remote() for x in range(100)])))

I just filed a PR that should fix this issue: https://github.com/ray-project/ray/pull/12108

Just for documentation purposes: We recently switched to docker as the default environment to run ray within autoscaled nodes. Ray Tune uses rsync to synchronize logs and checkpoints between nodes, but needs a specific syncer - the DockerSyncer - to work with autoscaler docker containers. The same is true for Kubernetes, by the way, which we also addressed in this PR.

With the PR a Docker/Kubernetes autoscaler environment is detected automatically and the correct syncer is passed.

Thanks Kai! Appreciate it!

Many thanks for this! Just ran into the same issue.

2 Likes

There still seems to be an issue here on sync_up when using DockerSyncer:

invalid output path: directory "/tmp/ray_tmp_mount/...

Interestingly earlier in the logs I see the directory being created:

VINFO command_runner.py:474 -- Running `^[[1mmkdir -p /tmp/ray_tmp_mount/...

I’ve made a comment to an issue with a similar offending directory: https://github.com/ray-project/ray/issues/12104.