How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
- Low: It annoys or frustrates me for a moment.
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
- High: It blocks me to complete my task.
I am spinning up clusters manually and mostly everything seems to work fine. But every single time there is one worker node that doesn’t execute the tasks (i’ve tried with 3 nodes to 20 nodes). Ive ssh’ed into the worker node to check ray status and everything seems OK. The only thing that I see different for this worker node vs the rest of the other worker nodes is the Plasma value in the dashboard. The plasma value is ‘N/A’. Other than that I do not see any other difference in the nodes. Any help to resolve this will be greatly appreciated. Thanks.
Can you share more details about how you’re starting the node? If you’re using the autoscaler/cluster launcher can you share your config?
I am starting the nodes manually and not using the autoscaler.
ray start --head --node-ip-address=“x.x.x.x” --port=6379 --dashboard-host=x.x.x.x --dashboard-port=443
and for worker nodes
ray start --address="$head_node_ip:6379" --node-ip-address=“y.y.y.y”
Just curious – what happens if you don’t pass the node-ip flag? Why is it necessary in your case?
The issue still persists. We are using node-ip-address as we are spinning up the process within a container on each node.
Are you running Ray start inside the worker node container?
I’d love to take a look into this – would you mind opening a bug report with reproduction details?
@Dmitri Can we resolve this issue and continue further discussion on the GitHub bug report?
Carrying over to GitHub as recommended.