Worker node cannot be added

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I currently have 1 CPU header node and 1 8-GPU worker node working fine. I tried to add another worker node. to the cluster by doing below command in the 2nd worker node, which is the same command I did in the 1st worker node

ray start --address=''

The output of ray status is fine, shows a total of 3 node and correct total GPUs and CPUs.
However, in the dashboard, it doesn’t show the added node’s resources; the log of the new node is weird and cannot be accessed either (see attached screenshot).

I tried to submit a ray job that needs 10GPUs (each worker node has 8 GPUs), it fails, so this means the new node is not added.

@ray.remote(num_gpus=10, max_retries=0, max_calls=1)
def test_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))

Is there any clue why this happens? Again same CLI works fine on 1st worker node (all nodes are ubuntu machines). I forget what pip packages or other libraries we installed on 1st worker node…if there is any library dependency…

Screen Shot 2022-12-12 at 10.19.49 PM

It is highly likely that your port is not properly configured. Have you made sure all ports in this doc is open Configuring Ray — Ray 3.0.0.dev0?

I have 1 CPU head node and 1 8-GPU node working already, so this means my ports should be fine?
Is there any additional port requirement for the new worker node?

Can you make sure all ports written here is open? Configuring Ray — Ray 3.0.0.dev0. Based on symptoms, it is highly likely ports of some components are not open