Ray up on a local provider cluster only starts head node

htagourti · July 22, 2025, 5:52pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.47.1
Python version: 3.12.7
OS: Ubuntu 24.04

3. What happened vs. what you expected:

Expected: When starting the cluster using the config below, I expect Ray to SSH into each host (head and workers), run the Docker container, and execute the setup commands defined. I have also tried adding min_workers=2 and max_workers=2.
Actual:
Only the head node is initialized and set up. The worker nodes listed in worker_ips are not contacted nor started. Terminal hangs after the head setup.

config.yaml :

cluster_name: rag-cluster
provider:
  type: local
  head_ip: 172.16.20.3
  worker_ips: [172.16.20.1, 172.16.20.2]  # Mandatory but does not automatically start the worker nodes for a local cluster

docker:
  image: .../myimage
  pull_before_run: true
  container_name: ray_node
  run_options:
    - --gpus all
    - -v /ray_mount/model_weights:/app/model_weights
    - -v /ray_mount/data:/app/data
    - -v /ray_mount/db:/app/db
    - -v /ray_mount/.hydra_config:/app/.hydra_config
    - -v /ray_mount/logs:/app/logs
    - --env-file /ray_mount/.env

auth:
  ssh_user: root
  ssh_private_key: lucie_chat_id_rsa

head_setup_commands:
  - bash /app/ray-cluster/start_head.sh
worker_setup_commands:
  - bash /app/ray-cluster/start_worker.sh

Logs Summary:

ClusterState correctly shows: ['172.16.20.3', '172.16.20.2']
Head node is launched, Docker image is pulled, setup commands run successfully
Worker nodes are not triggered at all
Terminal ends with Ray startup instructions for the head, no indication that workers are being set up

Full logs:
Cluster: rag-cluster

Checking Local environment settings
2025-07-22 17:40:29,616 INFO node_provider.py:53 – ClusterState: Loaded cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster.

Acquiring an up-to-date head node
2025-07-22 17:40:29,616 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
Launched a new head node
Fetching the new head node

<1/1> Setting up head node
Prepared bootstrap config
2025-07-22 17:40:29,618 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running uptime as a test.
Fetched IP: 172.16.20.3
Warning: Permanently added ‘172.16.20.3’ (ED25519) to the list of known hosts.
17:40:30 up 9:35, 2 users, load average: 0.20, 0.16, 0.11
Shared connection to 172.16.20.3 closed.
Success.
Updating cluster configuration. [hash=f8946df353933904aff5650b7fb3522bb7c132d7]
2025-07-22 17:40:30,141 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
[3/7] No worker file mounts to sync
2025-07-22 17:40:30,436 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initializing command runner
Shared connection to 172.16.20.3 closed.
Using default tag: latest
latest: Pulling from linagora/openrag-ray
Digest: sha256:cfbc0c67a16a6cd706afd011f7107b545b274da63050054cbcf403300658805c
Status: Image is up to date for …/myimage
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Tue Jul 22 17:40:31 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:01:00.0 Off | 0 |
| N/A 40C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
Shared connection to 172.16.20.3 closed.
7223a7d6027a0b182cf600da01cf7292abceee8d36daf9e92a39460f68a4c698
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
sending incremental file list
ray_bootstrap_config.yaml

sent 780 bytes received 35 bytes 1,630.00 bytes/sec
total size is 1,348 speedup is 1.65
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
sending incremental file list
ray_bootstrap_key.pem

sent 2,121 bytes received 35 bytes 4,312.00 bytes/sec
total size is 2,622 speedup is 1.22
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
[6/7] Running setup commands
(0/1) bash /app/ray-cluster/start_he…
Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]: Y
Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster.

Local node IP: 172.16.20.3

Ray runtime started.

Next steps
To add another node to this Ray cluster, run
ray start --address=‘172.16.20.3:6379’

To connect to this Ray cluster:
import ray
ray.init(_node_ip_address=‘172.16.20.3’)

To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS=‘http://172.16.20.3:8265’ ray job submit --working-dir . – python my_script.py

for more information on submitting Ray jobs to the Ray cluster.

To terminate the Ray runtime, run
ray stop

To view the status of the cluster, use
ray status

To monitor and debug Ray, view the dashboard at
172.16.20.3:8265

If connection to the dashboard fails, check your firewall settings and network configuration.

Terminal remains idle here after head setup; no activity or errors

christina · July 22, 2025, 9:15pm

Hi, do you know the SSH configuration of your head node? Is it able to SSH with the credentials that you’ve provided (root and lucie_chat_id_rsa)? Do they have the correct perms?

Also there are no networking issues with the head node either right?

Do you know what the configs look like for your workers, like do they have worker_ips configured in the YAML file?

htagourti · July 23, 2025, 9:38am

Hi, thanks for the follow-up.

Yes — the SSH configuration is working correctly and it is the same for all nodes (head and workers). The head node is able to SSH into all worker nodes using the provided credentials (root and lucie_chat_id_rsa). The private key has the correct permissions, and I’ve manually verified SSH access from the head node to each worker.

There are no networking issues either — all nodes are on the same LAN, can ping each other, and SSH works both ways.

As for the config, the worker_ips are explicitly defined in the YAML under provider.worker_ips, like this:

yaml

provider:
  type: local
  head_ip: 172.16.20.3
  worker_ips: [172.16.20.1, 172.16.20.2]

You can find the full yaml in my post.

Yet when I run ray up, only the head node is provisioned — the workers are never contacted. Let me know if there’s any extra flag or manual step required in local mode to trigger worker setup.

Thanks!

christina · July 28, 2025, 8:02pm

Oh, I see. It might be that if you are running it in local mode you might need to do some extra steps to trigger worker setup.

After reading some docs, it seems that in local mode, Ray does not automatically start worker nodes. You need to manually start Ray on each worker node. You can do this by SSHing into each worker node and running ray start --address='head_node_ip:port' , where head_node_ip is the IP address of your head node and port is the port Ray is listening on (default is 6379).

Let me know if this was helpful? I reread your post and i don’t think you ssh’d into the workers but lmk if you tried this out and you’re still not seeing the workers!

You can read more here: ray/doc/source/cluster/vms/user-guides/launching-clusters/on-premises.md at releases/2.47.1 · ray-project/ray · GitHub

adhoore · September 11, 2025, 8:29am

I have been trying to get the Ray “local” provider to work for a day now, and it really, really doesn’t want to cooperate. I want to launch my Ray cluster with “ray up” and have it start the head and worker nodes through Docker. I’m following the documentation almost exactly, so I’m not doing anything unusual. All my machines have identical operating systems, identical hardware, identical SSH keys… and they all have network and SSH access to each other. They all have rsync, Docker, etc. installed. In my opinion, it should work. I did all the right things.

And yet, when I start the cluster, only the head node comes up — none of the workers do. I can, of course, launch the workers manually with ray start on each worker, but that defeats the purpose of the “local” provider over SSH. I’m running all of this on Debian 13.0 machines (fresh install) with Ray 2.49.1. Someone should go over this documentation and see if it actually works, because for me, it doesn’t:
https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html#on-prem

Kind regards,
Alexander

christina · September 11, 2025, 5:06pm

Hi @adhoore ! Sorry to hear about that, feel free to reach out to me on Slack if you’re still having issues.

There’s actually a long discussion on GitHub issues re: [ray local cluster] nodes marked as uninitialized · Issue #39565 · ray-project/ray · GitHub Although it’s supposedly fixed by this PR: [Ray Cluster] Ray cluster stuck after using ray up · Issue #38718 · ray-project/ray · GitHub

Can you lmk what step in the guide it’s failing on, I’ll try to rerun it myself locally and debug it.Also, are there any logs also that indicate why they are failing to start the workers?

adhoore · September 11, 2025, 7:07pm

It’s not really “failing” on any of the steps from the documentation. Everything looks fine, just at the end… the workers are not started. But there are no errors or warnings or anything.

My logs look almost exactly the same to the monitor.log posted by user “jmakov” in this issue: [ray local cluster] nodes marked as uninitialized · Issue #39565 · ray-project/ray · GitHub

They end with:

2023-09-22 21:37:17,619 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.106.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.107.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.108.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.110.

And then… nothing.

Topic		Replies	Views
Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints? Ray Clusters	11	1662	June 17, 2022
Ray Up Not Starting Woker Ray Clusters	1	1387	May 12, 2022
How to start a cluster with ray up and local provider?	1	412	December 13, 2023
Odd behavior with Ray cluster setup with Docker Ray Core	0	317	April 13, 2023
Ray workers can't ssh to head node Ray Core	5	773	June 14, 2022

Ray up on a local provider cluster only starts head node

Ray runtime started.

Related topics