Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have been using ray in a cloud-based cluster for some batch tasks (run thousands of homogeneous tasks, call the same remote function many times with different input). Everything worked great, until recently I start to scale the cluster up further (currently about 300 cores across 20 workers).

Occasionally, some workers stop running new tasks and their CPU utilization goes to 0%. In the worst case, the scheduler stops entirely and all workers become idle, with all remaining tasks stuck in the PENDING_NODE_ASSIGNMENT state.

The issue persists even if I manually kill all worker servers and add fresh new workers to the cluster. The only way to restore to a healthy state is to restart the head node, which causes all the pending task to be lost.

The cluster currently runs ray 2.3.0. I have checked head node logs (debug_state.txt, debug_state_gcs.txt, gcs_server.err, monitor.err, etc.) and they contain no apparent errors. (Previously, gcs_server.err reported “too many open files”; this has been addressed.)

Any pointer on how to debug further will be extremely appreciated! Thanks in advance.

===
Screenshot: job dashboard showing tasks stuck in pending.

Below is the diagnostics output from ray status, ray summary tasks, and ray summary objects.

======== Autoscaler status: 2023-03-20 16:17:02.803611 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_37e23132b9b4e058bc249fd351e386c1db1a1bebdd95791f9042166d
 1 node_bf1e7380dc822cfb3be95c9b60deafeb19571b1a672c7f93b544a05c
 1 node_6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/164.0 CPU
 0.0/800.0 Memory
 0.00/788.691 GiB memory
 0.00/4.727 GiB object_store_memory

Demands:
 {'CPU': 2.0, 'Memory': 10.0}: 2234+ pending tasks/actors

======== Tasks Summary: 2023-03-20 16:17:07.231791 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 0
total_tasks: 4748


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME    STATE_COUNTS                   TYPE
0   run_on_docker         FAILED: 725                    NORMAL_TASK
                          FINISHED: 1612
                          PENDING_NODE_ASSIGNMENT: 2244
                          RUNNING: 160
1   check_summary         FINISHED: 5                    NORMAL_TASK
                          RUNNING: 1
2   check_uptime          PENDING_NODE_ASSIGNMENT: 1     NORMAL_TASK


======== Object Summary: 2023-03-20 16:17:08.433759 ========
Stats:
------------------------------------
callsite_enabled: false
total_objects: 2987
total_size_mb: 950.1486234664917


Table (group by callsite)
------------------------------------
disabled
    REF_TYPE_COUNTS        TASK_STATE_COUNTS                            TOTAL_NUM_NODES    TOTAL_NUM_WORKERS    TOTAL_OBJECTS    TOTAL_SIZE_MB
--  ---------------------  -------------------------------------------  -----------------  -------------------  ---------------  ---------------
0   LOCAL_REFERENCE: 2987  'Attempt #2: PENDING_NODE_ASSIGNMENT': 1764  1                  2                    2987             950.149
                           FINISHED: 581
                           PENDING_NODE_ASSIGNMENT: 481
                           SUBMITTED_TO_WORKER: 161

Some related threads:

Here all the workers have different IP. Network reachability is not an issue as smaller scale cluster worked fine.

The tasks are stuck for hours (both dashboand and CLI shows worker is idle / new worker has been added), so the issue is likely not related to synchronization.

ray.nodes() call correctly reports dead workers and alive (idle) workers, same as in dashboard.

Hi @DannyChen, thanks for the great report. cc @jjyao, what kind of information would be helpful in getting to the bottom of this issue?

@DannyChen based on what you pasted, there are 160 run_on_dockers that are running but ray status shows that no resources are used. Are these tasks using 2 cpu and 10 memory each?

Hi @jjyao , thanks for the quick reply! Yes, all tasks cost 2CPU+10Memory. Those “running” tasks are “zombies” now (the worker has already entered 0% CPU idle state).

I have already killed those workers, but the 160 “running” tasks didn’t go into “failed”.

The number 160 comes from the previous cluster size (320 cores). Out of curiosity, I added 400 CPUs to the cluster, beyond 2x160 required by the “running” tasks. The scheduler is still stuck :frowning:

However, I noticed in the dashboard that workers have “0kb/0kb” object store memory. Is this abnormal?

`ray status` output
======== Autoscaler status: 2023-03-20 19:23:06.702801 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_5fb36e7769df5394cfe9be2c84cd61547dd4e53529957cbcc7e7375b
 1 node_535ecf989c96179367a664d18c40239c0e11cc27e1e3d6a443a178b6
 1 node_7dbd9b7c5d1e38cae04e55762d505be991751ff9896b84f8cde28106
 1 node_6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4
 1 node_3bfb1ffa2fa9b4d6a8eefab6e15c1138dbc4defeb309958180740698
 1 node_3dc0942946af1c1ab9305bb6a73d843f69161ff0c90aad0698962d9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/404.0 CPU
 0.0/2000.0 Memory
 0.00/1963.531 GiB memory
 0.00/7.657 GiB object_store_memory

Demands:
 {'CPU': 2.0, 'Memory': 10.0}: 2234+ pending tasks/actors

======== Tasks Summary: 2023-03-20 19:23:11.693436 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 0
total_tasks: 4752


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME    STATE_COUNTS                   TYPE
0   run_on_docker         FAILED: 725                    NORMAL_TASK
                          FINISHED: 1612
                          PENDING_NODE_ASSIGNMENT: 2244
                          RUNNING: 160
1   check_summary         FINISHED: 9                    NORMAL_TASK
                          RUNNING: 1
2   check_uptime          PENDING_NODE_ASSIGNMENT: 1     NORMAL_TASK

This definitely looks wrong. How did you start the worker node? Are you using ray start?

Could you also show the output of ray list nodes?

Also I’m happy to do a quick video chat to make the debugging faster.

The worker nodes are freshly spinned up VMs that use ray start to connect to the head node:

ray start  --disable-usage-stats --address="10.10.10.10:6379" --redis-password="****"  --resources='{"Memory":"?00"}'   --object-store-memory 1048576000 

The output of ray list nodes is as follows:


======== List: 2023-03-20 20:04:11.093174 ========
Stats:
------------------------------
Total: 36

Table:
------------------------------
    NODE_ID                                                   NODE_IP        STATE    NODE_NAME      RESOURCES_TOTAL
 0  194de67549f11701e30f604b5f61991d56f4ce8213f092826b599baa  10.10.44.102   ALIVE    10.10.44.102   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420437946368.0
                                                                                                     node:10.10.44.102: 1.0
                                                                                                     object_store_memory: 1048576000.0
 1  1ba20a961e1ee756177e8d79b3f9903e5a49930d5857f840215cdef7  10.10.78.234   DEAD     10.10.78.234   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420430684160.0
                                                                                                     node:10.10.78.234: 1.0
                                                                                                     object_store_memory: 1048576000.0
 2  2e404ffda35765d9e8636d9fd113f7d806c9571db29c46b173aa1a3b  10.10.129.115  DEAD     10.10.129.115  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420475068416.0
                                                                                                     node:10.10.129.115: 1.0
                                                                                                     object_store_memory: 1048576000.0
 3  2ed83abd5b7496296e623d0e58595144928353e8bf2209ade58ce0fd  10.10.41.214   ALIVE    10.10.41.214   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420476485632.0
                                                                                                     node:10.10.41.214: 1.0
                                                                                                     object_store_memory: 1048576000.0
 4  2edef70fb117ef4bae491e666bafdc0d121367b652414e01de71843b  10.10.91.234   ALIVE    10.10.91.234   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420450050048.0
                                                                                                     node:10.10.91.234: 1.0
                                                                                                     object_store_memory: 1048576000.0
 5  37e23132b9b4e058bc249fd351e386c1db1a1bebdd95791f9042166d  10.10.139.118  DEAD     10.10.139.118  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420495740928.0
                                                                                                     node:10.10.139.118: 1.0
                                                                                                     object_store_memory: 1048576000.0
 6  392a3c17125b9a704037f0c196c2fe40869e7d085fd1654fd8f525f0  10.10.68.22    DEAD     10.10.68.22    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420473323520.0
                                                                                                     node:10.10.68.22: 1.0
                                                                                                     object_store_memory: 1048576000.0
 7  3bfb1ffa2fa9b4d6a8eefab6e15c1138dbc4defeb309958180740698  10.10.83.212   DEAD     10.10.83.212   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420476932096.0
                                                                                                     node:10.10.83.212: 1.0
                                                                                                     object_store_memory: 1048576000.0
 8  3dc0942946af1c1ab9305bb6a73d843f69161ff0c90aad0698962d9b  10.10.90.164   DEAD     10.10.90.164   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420494503936.0
                                                                                                     node:10.10.90.164: 1.0
                                                                                                     object_store_memory: 1048576000.0
 9  40c96f4ec71ceebfee1725980647983716d0c7b61cce25782de1358c  10.10.148.34   DEAD     10.10.148.34   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420462432256.0
                                                                                                     node:10.10.148.34: 1.0
                                                                                                     object_store_memory: 1048576000.0
10  535ecf989c96179367a664d18c40239c0e11cc27e1e3d6a443a178b6  10.10.191.86   DEAD     10.10.191.86   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420436226048.0
                                                                                                     node:10.10.191.86: 1.0
                                                                                                     object_store_memory: 1048576000.0
11  590b271f38178f25f962c0a0c4f7aeeba9e0b2a628cd20345ee1abc2  10.10.179.186  DEAD     10.10.179.186  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420499181568.0
                                                                                                     node:10.10.179.186: 1.0
                                                                                                     object_store_memory: 1048576000.0
12  5b8e173b6e35645a274588bf0d164570d61e2fe81cd037d997e954ba  10.10.179.213  DEAD     10.10.179.213  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420494204928.0
                                                                                                     node:10.10.179.213: 1.0
                                                                                                     object_store_memory: 1048576000.0
13  5fb36e7769df5394cfe9be2c84cd61547dd4e53529957cbcc7e7375b  10.10.75.22    DEAD     10.10.75.22    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420484145152.0
                                                                                                     node:10.10.75.22: 1.0
                                                                                                     object_store_memory: 1048576000.0
14  651d1dfe3c463dd047ac60edd1d4c96c96ae49adf7860afa01ec2110  10.10.204.22   ALIVE    10.10.204.22   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420457480192.0
                                                                                                     node:10.10.204.22: 1.0
                                                                                                     object_store_memory: 1048576000.0
15  6e93afeff52e8c6e1e160fa3f6f723c1279cd62b48fe9dc7876c674e  10.10.55.68    DEAD     10.10.55.68    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420479401984.0
                                                                                                     node:10.10.55.68: 1.0
                                                                                                     object_store_memory: 1048576000.0
16  6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4  10.10.10.10    ALIVE    10.10.10.10    CPU: 4.0
                                                                                                     memory: 5956526900.0
                                                                                                     node:10.10.10.10: 1.0
                                                                                                     object_store_memory: 2978263449.0
17  7b32547373e1eae789f01a7037f4ca6bef42a38860c23508a8e73c22  10.10.14.11    ALIVE    10.10.14.11    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420470149120.0
                                                                                                     node:10.10.14.11: 1.0
                                                                                                     object_store_memory: 1048576000.0
18  7dbd9b7c5d1e38cae04e55762d505be991751ff9896b84f8cde28106  10.10.124.217  DEAD     10.10.124.217  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420476768256.0
                                                                                                     node:10.10.124.217: 1.0
                                                                                                     object_store_memory: 1048576000.0
19  86332cc9670e195f09c362018d4941f4f8f0f0821a2cade1eee471d7  10.10.207.11   DEAD     10.10.207.11   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420494401536.0
                                                                                                     node:10.10.207.11: 1.0
                                                                                                     object_store_memory: 1048576000.0
20  8691f62768248964ed41b9668a3cd93237f661fbf622935ec97b6fb5  10.10.255.214  ALIVE    10.10.255.214  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420471754752.0
                                                                                                     node:10.10.255.214: 1.0
                                                                                                     object_store_memory: 1048576000.0
21  95459da3a4f3204badf1dddd240b373c9fa1523195f93f8ba91a0091  10.10.243.8    ALIVE    10.10.243.8    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420472946688.0
                                                                                                     node:10.10.243.8: 1.0
                                                                                                     object_store_memory: 1048576000.0
22  9d79d5eb00ddf9745b3dbc93b0c72b0dbfe4af56e07a84f6249c622b  10.10.189.12   DEAD     10.10.189.12   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420486901760.0
                                                                                                     node:10.10.189.12: 1.0
                                                                                                     object_store_memory: 1048576000.0
23  a2201b214cc2c9da7ca4178046b7804e77716610c0bfd750a1232d43  10.10.141.198  DEAD     10.10.141.198  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420456001536.0
                                                                                                     node:10.10.141.198: 1.0
                                                                                                     object_store_memory: 1048576000.0
24  ab6ac53d1036bf48629094008d77b9567f94fcf17406762340046e01  10.10.82.247   DEAD     10.10.82.247   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420494905344.0
                                                                                                     node:10.10.82.247: 1.0
                                                                                                     object_store_memory: 1048576000.0
25  abf5c4ec83fa410909232700c6aebf79565de9db5f1a301ea3fee532  10.10.6.246    DEAD     10.10.6.246    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420464316416.0
                                                                                                     node:10.10.6.246: 1.0
                                                                                                     object_store_memory: 1048576000.0
26  adf9a44c4b30d0da3b3f6ce6205300387139852d1cac38e31ca90b5b  10.10.245.208  DEAD     10.10.245.208  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420472823808.0
                                                                                                     node:10.10.245.208: 1.0
                                                                                                     object_store_memory: 1048576000.0
27  ba06e0f430bffc2f4b1ccf050cc4e1f76a4adebc438a2fbc0b0420f5  10.10.39.95    DEAD     10.10.39.95    CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420504715264.0
                                                                                                     node:10.10.39.95: 1.0
                                                                                                     object_store_memory: 1048576000.0
28  bf1e7380dc822cfb3be95c9b60deafeb19571b1a672c7f93b544a05c  10.10.177.103  DEAD     10.10.177.103  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420397867008.0
                                                                                                     node:10.10.177.103: 1.0
                                                                                                     object_store_memory: 1048576000.0
29  c55bc798e65f113da8dfadc4326a00444b7a035782f618a2a539d4d7  10.10.254.21   DEAD     10.10.254.21   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420462739456.0
                                                                                                     node:10.10.254.21: 1.0
                                                                                                     object_store_memory: 1048576000.0
30  daf2cfba1cbbcf6042b8384f1e8783594e23efb909ac80a6d5db2917  10.10.159.183  DEAD     10.10.159.183  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420472111104.0
                                                                                                     node:10.10.159.183: 1.0
                                                                                                     object_store_memory: 1048576000.0
31  e1aa43ae23d9bd6d54ec63b122814a48bdacfb340c707cd5693360f4  10.10.241.81   ALIVE    10.10.241.81   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420448931840.0
                                                                                                     node:10.10.241.81: 1.0
                                                                                                     object_store_memory: 1048576000.0
32  e8e82b708ea09b5cdf0a57a0e15b7dc78a9fda12c535d9285678ae14  10.10.48.207   ALIVE    10.10.48.207   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420466184192.0
                                                                                                     node:10.10.48.207: 1.0
                                                                                                     object_store_memory: 1048576000.0
33  ea3237fc97a3ce2a1ad3dd4592671cbcf8e404bae2b938cf589bb353  10.10.185.220  DEAD     10.10.185.220  CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420465213440.0
                                                                                                     node:10.10.185.220: 1.0
                                                                                                     object_store_memory: 1048576000.0
34  f049191cd01415da37a663414a282c57d97c1c549c9d2f5372a16dbe  10.10.111.105  DEAD     10.10.111.105  CPU: 2.0
                                                                                                     Memory: 20.0
                                                                                                     memory: 19528146944.0
                                                                                                     node:10.10.111.105: 1.0
                                                                                                     object_store_memory: 1048576000.0
35  f69bc8947e4946454297294f90c9f5b5e439324333065729cdd4c52c  10.10.69.250   DEAD     10.10.69.250   CPU: 80.0
                                                                                                     Memory: 400.0
                                                                                                     memory: 420463702016.0
                                                                                                     node:10.10.69.250: 1.0
                                                                                                     object_store_memory: 1048576000.0

Happy to chat! I’ll reach out to you on Ray slack channel.

It seems like I get into this problem too, May I ask how this problem solve eventually?

Based on our discussion, one possible reason is the caller script (the one doing ray.get([...]) with thousands of tasks) is running on a remote client, which, when disconnected, lead to malfunction of Ray’s retry mechanism upon worker being killed. Changing the script to make this ray.get happen on the head node might solve the issue. (However I haven’t tested this in our production environment yet)