Ray cluster up on-premise

Does anybody has a working example of a ray cluster up with on-premise machines? I keep getting SSH error, and I have no idea how to properly setup the yaml file. Documentation is very poor too.

New status: update-failed
!!!
Exception details: {‘message’: ‘SSH command failed.’}
Full traceback: Traceback (most recent call last):
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py”, line 166, in run
self.do_update()
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py”, line 457, in do_update
self.cmd_runner.run_init(
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py”, line 821, in run_init
dst=self._docker_expand_user(mount),
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py”, line 635, in _docker_expand_user
self.ssh_command_runner.run(
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py”, line 379, in run
return self._run_helper(
File “/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py”, line 298, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

Error message: SSH command failed.
!!!

Failed to setup head node.

Hi matters!

So, from my understanding, you’re encountering SSH errors while trying to set up a Ray cluster on on-premise machines? Here are some things that might help:

  1. SSH Configuration: Ensure that SSH is properly configured on all machines in your cluster. This includes:
  • SSH keys are correctly set up and distributed to all nodes.
  • The SSH user specified in your YAML configuration has the necessary permissions to access the nodes.
  • SSH daemon (sshd) is running on all nodes.
  1. YAML Configuration: The YAML file is crucial for setting up the cluster. Are you sure your YAML file is valid? Try running it thru some validators if you can.
  2. Troubleshooting SSH Errors:
  • Verify that you can manually SSH into each node using the same credentials and keys specified in your YAML file.
  • Check the SSH logs on the nodes for any errors or warnings that might indicate what is going wrong.
  1. Testing with a Simple Setup: Start with a minimal setup, perhaps just the head node, and ensure that works before scaling up to include worker nodes.

I found some docs that might help here. Let me know if any of this helped!!

Docs:

Hi @christina, thank you for the prompt reply!

Before posting, I verified that I could manually SSH into each node.
I am executing ray up command inside a docker container, and to use my key, I’m mapping the .ssh folder as a volume (with the key) into the container. However, the error is not very descriptive.

Do you know if this could be causing the error?

If I do it manually (ray start CLI) on each node, everything is working.

Hi matters,
Yea, I do think that ray up could be leading to some issues. Maybe we can try a few other things, so, since you’re mapping .ssh into the container, check that permissions are correct (chmod 600 ~/.ssh/id_rsa) and that your container has access to the SSH agent (-v $SSH_AUTH_SOCK:/ssh-agent).

Can you also maybe look into setting up a Ray debugger and running your ray up command with a verbose flag to see if the error gives anything new? Ray Distributed Debugger — Ray 2.42.1

In the meantime, you can probably just be running ray start manually on each node as a temporary workaround, as you mentioned this works. This can help isolate whether the issue is with the ray up command or the SSH setup.

Hi @christina,

Thank you for the info.
The thing is, the SSH error appears after I can get a nvidia-smi command from the remote node. It shows the correct output from the remote machine. Sorry for not mentioning earlier. However, after that, the error shows up.

Shared connection to IP_ADDRESS closed.                                                    
a91f0e8a7e882fe344a36109b4c472e54e0307526d705c97ee73c1d3bf325685
Shared connection to IP_ADDRESS closed.
Shared connection to IP_ADDRESSclosed.
2025-02-12 01:47:22,035 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['IP_ADDRESS']
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
    self.do_update()
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 457, in do_update
    self.cmd_runner.run_init(
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 821, in run_init
    dst=self._docker_expand_user(mount),
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 635, in _docker_expand_user
    self.ssh_command_runner.run(
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
  File "/workspace/isaaclab/_isaac_sim/kit/python/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!
   
  Failed to setup head node.
There was an error running python

Maybe I’ll skip ray cluster and create a sh file to open the resources I need (which are only 5), but thanks anyways!

Hmmm… okay… So after some debugging / googling, maybe the remote machine’s SSH server is set to close idle sessions quickly (check /etc/ssh/sshd_config for ClientAliveInterval and ClientAliveCountMax)? Or maybe Docker’s SSH session gets interrupted due to network or container issues.

If your Ray cluster YAML specifies docker options for running commands inside a container on the worker nodes, check whether it’s causing conflicts. Does running without Docker in the YAML work?

The log mentions ClusterState: Writing cluster state, which might indicate a failure in storing metadata. Make sure your user inside the container has write permissions to wherever Ray is saving state files.

But yea, if you’re only managing 5 nodes, manually starting Ray with a simple script (ray start --head on the main node and ray start --address=:6379 on the others) might be a good workaround. Let me know what ends up working :'D