How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I currently have 1 CPU header node and 1 8-GPU worker node working fine. I tried to add another worker node. to the cluster by doing below command in the 2nd worker node, which is the same command I did in the 1st worker node
ray start --address='10.216.142.235:6379'
The output of ray status
is fine, shows a total of 3 node and correct total GPUs and CPUs.
However, in the dashboard, it doesn’t show the added node’s resources; the log of the new node is weird and cannot be accessed either (see attached screenshot).
I tried to submit a ray job that needs 10GPUs (each worker node has 8 GPUs), it fails, so this means the new node is not added.
@ray.remote(num_gpus=10, max_retries=0, max_calls=1)
def test_gpu():
print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
Is there any clue why this happens? Again same CLI works fine on 1st worker node (all nodes are ubuntu machines). I forget what pip packages or other libraries we installed on 1st worker node…if there is any library dependency…