Ray cluster failed to launch with my custom image

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

High

Hi,

I’m trying to deploy a ray cluster with my custom image: deep-learning-containers/available_images.md at master · aws/deep-learning-containers · GitHub. However, the setup fails with some rsync error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Shared connection to 3.238.226.224 closed.
2023-03-08 21:49:17,080 WARNING command_runner.py:1057 -- Nvidia Container Runtime is present, but no GPUs found.
ae672b07f18b4cf65f917cb9f0d35c9d80a793eb805abbce70cc953e8a0a15ff
Shared connection to 3.238.226.224 closed.
Shared connection to 3.238.226.224 closed.
protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [sender=3.1.2]
Shared connection to 3.238.226.224 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!
  
  Failed to setup head node.

Any idea?

Could you kindly provide further information? I am having difficulty comprehending the question. It would be greatly appreciated if you could share a detailed reproduction script. Thanks!

Is rsync installed in your custom image?

1 Like

@amiasato thanks! I hit the same error as the OP, and installing rsync into my custom image fixed it.