How severe does this issue affect your experience of using Ray? Medium: It contributes to significant difficulty to complete my task, but I can work around it. I have been using ray in a cloud-based cluster for some batch tasks (run thousands of homogeneous tasks, call the same remote function ma…

Some related threads: [image] Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT Ray Core Note that they all have the same IP for some reason. cc @sangcho this is probably yet another public/private IP problem Here all the workers have different IP. Network reachab…

Hi @DannyChen , thanks for the great report. cc @jjyao , what kind of information would be helpful in getting to the bottom of this issue?

@DannyChen based on what you pasted, there are 160 run_on_dockers that are running but ray status shows that no resources are used. Are these tasks using 2 cpu and 10 memory each?

Hi @jjyao , thanks for the quick reply! Yes, all tasks cost 2CPU+10Memory. Those “running” tasks are “zombies” now (the worker has already entered 0% CPU idle state). I have already killed those workers, but the 160 “running” tasks didn’t go into “failed”.

The number 160 comes from the previous cluster size (320 cores). Out of curiosity, I added 400 CPUs to the cluster, beyond 2x160 required by the “running” tasks. The scheduler is still stuck :frowning: However, I noticed in the dashboard that workers have “0kb/0kb” object store memory. Is this abno…

[image] DannyChen: However, I noticed in the dashboard that workers have “0kb/0kb” object store memory. Is this abnormal? This definitely looks wrong. How did you start the worker node? Are you using ray start? Could you also show the output of ray list nodes? Also I’m happy to do a quick v…

The worker nodes are freshly spinned up VMs that use ray start to connect to the head node: ray start --disable-usage-stats --address="10.10.10.10:6379" --redis-password="****" --resources='{"Memory":"?00"}' --object-store-memory 1048576000 The output of ray list nodes is as follows: ======…

It seems like I get into this problem too, May I ask how this problem solve eventually?

Based on our discussion, one possible reason is the caller script (the one doing ray.get([...]) with thousands of tasks) is running on a remote client, which, when disconnected, lead to malfunction of Ray’s retry mechanism upon worker being killed. Changing the script to make this ray.get happen on …

Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever

Ray Clusters

DannyChen March 20, 2023, 5:38pm 2

Some related threads:

Here all the workers have different IP. Network reachability is not an issue as smaller scale cluster worked fine.

The tasks are stuck for hours (both dashboand and CLI shows worker is idle / new worker has been added), so the issue is likely not related to synchronization.

ray.nodes() call correctly reports dead workers and alive (idle) workers, same as in dashboard.

Topic		Replies	Views
Join tasks getting stuck in PENDING_NODE_ASSIGNMENT Ray Data	7	411	May 21, 2025
Pending tasks not starting up Kubernetes	7	1723	May 13, 2022
Local Ray cluster won't send any tasks to worker node Ray Clusters	11	1110	August 6, 2024
Remote Worker Nodes die after a few seconds Ray Clusters	5	2220	July 17, 2024
Ray job is stuck when node worker runs on is killed Ray Core	3	1898	July 1, 2022

Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever

Related topics