Worker node cannot be added

Allie_Yang · December 13, 2022, 6:26am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I currently have 1 CPU header node and 1 8-GPU worker node working fine. I tried to add another worker node. to the cluster by doing below command in the 2nd worker node, which is the same command I did in the 1st worker node

ray start --address='10.216.142.235:6379'

The output of ray status is fine, shows a total of 3 node and correct total GPUs and CPUs.
However, in the dashboard, it doesn’t show the added node’s resources; the log of the new node is weird and cannot be accessed either (see attached screenshot).

I tried to submit a ray job that needs 10GPUs (each worker node has 8 GPUs), it fails, so this means the new node is not added.

@ray.remote(num_gpus=10, max_retries=0, max_calls=1)
def test_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))

Is there any clue why this happens? Again same CLI works fine on 1st worker node (all nodes are ubuntu machines). I forget what pip packages or other libraries we installed on 1st worker node…if there is any library dependency…

sangcho · December 13, 2022, 7:06am

It is highly likely that your port is not properly configured. Have you made sure all ports in this doc is open Configuring Ray — Ray 3.0.0.dev0?

Allie_Yang · December 13, 2022, 9:02pm

I have 1 CPU head node and 1 8-GPU node working already, so this means my ports should be fine?
Is there any additional port requirement for the new worker node?

sangcho · December 13, 2022, 11:33pm

Can you make sure all ports written here is open? Configuring Ray — Ray 3.0.0.dev0. Based on symptoms, it is highly likely ports of some components are not open

Topic		Replies	Views
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1704	December 15, 2020
Workers Not Recognized on new Cluster Ray Clusters	5	601	March 3, 2023
Worker node workers/cores aren't not working	1	599	May 2, 2022
Unable to add any worker node to the head node - Raspberry Pi cluster Ray Clusters	0	143	January 9, 2024
Ray up doesn't add worker resources to ray.status() Ray Tune	3	391	November 29, 2021

Worker node cannot be added

Related topics