I’m seeing a strange error when I am using Ray Tune.
2020-11-16 23:56:48,002 ERROR trial_runner.py:868 -- Trial WrappedDistributedTorchTrainable_57085_00001: Error handling checkpoint /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/checkpoint_7/./
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 864, in _process_trial_save
trial.on_checkpoint(trial.saving_to)
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py", line 493, in on_checkpoint
raise e
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/trial.py", line 479, in on_checkpoint
self.result_logger.wait()
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/logger.py", line 378, in wait
self._log_syncer.wait()
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 193, in wait
self.sync_client.wait()
File "/root/anaconda3/lib/python3.7/site-packages/ray/tune/sync_client.py", line 208, in wait
args, code, error_msg))
ray.tune.error.TuneError: Sync error. Ran command: rsync -savz -e 'ssh -i /root/ray_bootstrap_key.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no' root@172.31.217.164:/root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/ /root/ray_results/WrappedDistributedTorchTrainable_2020-11-16_23-56-27/WrappedDistributedTorchTrainable_57085_00001_1_2020-11-16_23-56-38/
Error message (2): protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [Receiver=3.1.2]
I am using Docker but when I ran the following, I only received a single version on multiple nodes:
import ray
ray.init()
print(ray.available_resources())
import subprocess
@ray.remote
def func():
return subprocess.check_output(["rsync", "--version"])
print(set(ray.get([func.remote() for x in range(100)])))