1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.47.1
- Python version: 3.12.7
- OS: Ubuntu 24.04
3. What happened vs. what you expected:
-
Expected: When starting the cluster using the config below, I expect Ray to SSH into each host (head and workers), run the Docker container, and execute the setup commands defined. I have also tried adding min_workers=2 and max_workers=2.
-
Actual:
Only the head node is initialized and set up. The worker nodes listed inworker_ipsare not contacted nor started. Terminal hangs after the head setup.
config.yaml :
cluster_name: rag-cluster
provider:
type: local
head_ip: 172.16.20.3
worker_ips: [172.16.20.1, 172.16.20.2] # Mandatory but does not automatically start the worker nodes for a local cluster
docker:
image: .../myimage
pull_before_run: true
container_name: ray_node
run_options:
- --gpus all
- -v /ray_mount/model_weights:/app/model_weights
- -v /ray_mount/data:/app/data
- -v /ray_mount/db:/app/db
- -v /ray_mount/.hydra_config:/app/.hydra_config
- -v /ray_mount/logs:/app/logs
- --env-file /ray_mount/.env
auth:
ssh_user: root
ssh_private_key: lucie_chat_id_rsa
head_setup_commands:
- bash /app/ray-cluster/start_head.sh
worker_setup_commands:
- bash /app/ray-cluster/start_worker.sh
Logs Summary:
ClusterStatecorrectly shows:['172.16.20.3', '172.16.20.2']- Head node is launched, Docker image is pulled, setup commands run successfully
- Worker nodes are not triggered at all
- Terminal ends with Ray startup instructions for the head, no indication that workers are being set up
Full logs:
Cluster: rag-cluster
Checking Local environment settings
2025-07-22 17:40:29,616 INFO node_provider.py:53 – ClusterState: Loaded cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster.
Acquiring an up-to-date head node
2025-07-22 17:40:29,616 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
2025-07-22 17:40:29,618 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running uptime as a test.
Fetched IP: 172.16.20.3
Warning: Permanently added ‘172.16.20.3’ (ED25519) to the list of known hosts.
17:40:30 up 9:35, 2 users, load average: 0.20, 0.16, 0.11
Shared connection to 172.16.20.3 closed.
Success.
Updating cluster configuration. [hash=f8946df353933904aff5650b7fb3522bb7c132d7]
2025-07-22 17:40:30,141 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
[3/7] No worker file mounts to sync
2025-07-22 17:40:30,436 INFO node_provider.py:114 – ClusterState: Writing cluster state: [‘172.16.20.3’, ‘172.16.20.2’]
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initializing command runner
Shared connection to 172.16.20.3 closed.
Using default tag: latest
latest: Pulling from linagora/openrag-ray
Digest: sha256:cfbc0c67a16a6cd706afd011f7107b545b274da63050054cbcf403300658805c
Status: Image is up to date for …/myimage
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
Tue Jul 22 17:40:31 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:01:00.0 Off | 0 |
| N/A 40C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
Shared connection to 172.16.20.3 closed.
7223a7d6027a0b182cf600da01cf7292abceee8d36daf9e92a39460f68a4c698
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
sending incremental file list
ray_bootstrap_config.yaml
sent 780 bytes received 35 bytes 1,630.00 bytes/sec
total size is 1,348 speedup is 1.65
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
sending incremental file list
ray_bootstrap_key.pem
sent 2,121 bytes received 35 bytes 4,312.00 bytes/sec
total size is 2,622 speedup is 1.22
Shared connection to 172.16.20.3 closed.
Shared connection to 172.16.20.3 closed.
[6/7] Running setup commands
(0/1) bash /app/ray-cluster/start_he…
Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]: Y
Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster.
Local node IP: 172.16.20.3
Ray runtime started.
Next steps
To add another node to this Ray cluster, run
ray start --address=‘172.16.20.3:6379’
To connect to this Ray cluster:
import ray
ray.init(_node_ip_address=‘172.16.20.3’)
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS=‘http://172.16.20.3:8265’ ray job submit --working-dir . – python my_script.py
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
172.16.20.3:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
- Terminal remains idle here after head setup; no activity or errors