How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have been using ray in a cloud-based cluster for some batch tasks (run thousands of homogeneous tasks, call the same remote function many times with different input). Everything worked great, until recently I start to scale the cluster up further (currently about 300 cores across 20 workers).
Occasionally, some workers stop running new tasks and their CPU utilization goes to 0%. In the worst case, the scheduler stops entirely and all workers become idle, with all remaining tasks stuck in the PENDING_NODE_ASSIGNMENT
state.
The issue persists even if I manually kill all worker servers and add fresh new workers to the cluster. The only way to restore to a healthy state is to restart the head node, which causes all the pending task to be lost.
The cluster currently runs ray 2.3.0. I have checked head node logs (debug_state.txt, debug_state_gcs.txt, gcs_server.err, monitor.err, etc.) and they contain no apparent errors. (Previously, gcs_server.err reported “too many open files”; this has been addressed.)
Any pointer on how to debug further will be extremely appreciated! Thanks in advance.
===
Screenshot: job dashboard showing tasks stuck in pending.
Below is the diagnostics output from ray status
, ray summary tasks
, and ray summary objects
.
======== Autoscaler status: 2023-03-20 16:17:02.803611 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_37e23132b9b4e058bc249fd351e386c1db1a1bebdd95791f9042166d
1 node_bf1e7380dc822cfb3be95c9b60deafeb19571b1a672c7f93b544a05c
1 node_6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/164.0 CPU
0.0/800.0 Memory
0.00/788.691 GiB memory
0.00/4.727 GiB object_store_memory
Demands:
{'CPU': 2.0, 'Memory': 10.0}: 2234+ pending tasks/actors
======== Tasks Summary: 2023-03-20 16:17:07.231791 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 0
total_tasks: 4748
Table (group by func_name):
------------------------------------
FUNC_OR_CLASS_NAME STATE_COUNTS TYPE
0 run_on_docker FAILED: 725 NORMAL_TASK
FINISHED: 1612
PENDING_NODE_ASSIGNMENT: 2244
RUNNING: 160
1 check_summary FINISHED: 5 NORMAL_TASK
RUNNING: 1
2 check_uptime PENDING_NODE_ASSIGNMENT: 1 NORMAL_TASK
======== Object Summary: 2023-03-20 16:17:08.433759 ========
Stats:
------------------------------------
callsite_enabled: false
total_objects: 2987
total_size_mb: 950.1486234664917
Table (group by callsite)
------------------------------------
disabled
REF_TYPE_COUNTS TASK_STATE_COUNTS TOTAL_NUM_NODES TOTAL_NUM_WORKERS TOTAL_OBJECTS TOTAL_SIZE_MB
-- --------------------- ------------------------------------------- ----------------- ------------------- --------------- ---------------
0 LOCAL_REFERENCE: 2987 'Attempt #2: PENDING_NODE_ASSIGNMENT': 1764 1 2 2987 950.149
FINISHED: 581
PENDING_NODE_ASSIGNMENT: 481
SUBMITTED_TO_WORKER: 161
Some related threads:
Note that they all have the same IP for some reason.
cc @sangcho this is probably yet another public/private IP problem
Here all the workers have different IP. Network reachability is not an issue as smaller scale cluster worked fine.
Hmm it’s possible the ray.nodes() discrepancy is because the cluster state takes some time to converge.
If you can reproduce the issue right now, you could try checking where your application is getting stuck. This docs page on debugging might be useful to look at. Here are some relevant tools you can try:
ray memory CLI will tell you which ObjectRefs are currently in scope and which are still pending execution.
Passing the OS environment variable RAY_record_ref_creation_sites=1 to Ray will …
The tasks are stuck for hours (both dashboand and CLI shows worker is idle / new worker has been added), so the issue is likely not related to synchronization.
ray.nodes()
call correctly reports dead workers and alive (idle) workers, same as in dashboard.
cade
March 20, 2023, 6:46pm
3
Hi @DannyChen , thanks for the great report. cc @jjyao , what kind of information would be helpful in getting to the bottom of this issue?
jjyao
March 20, 2023, 7:06pm
4
@DannyChen based on what you pasted, there are 160 run_on_docker
s that are running but ray status
shows that no resources are used. Are these tasks using 2 cpu and 10 memory each?
Hi @jjyao , thanks for the quick reply! Yes, all tasks cost 2CPU+10Memory. Those “running” tasks are “zombies” now (the worker has already entered 0% CPU idle state).
I have already killed those workers, but the 160 “running” tasks didn’t go into “failed”.
The number 160 comes from the previous cluster size (320 cores). Out of curiosity, I added 400 CPUs to the cluster, beyond 2x160 required by the “running” tasks. The scheduler is still stuck
However, I noticed in the dashboard that workers have “0kb/0kb” object store memory. Is this abnormal?
`ray status` output
======== Autoscaler status: 2023-03-20 19:23:06.702801 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_5fb36e7769df5394cfe9be2c84cd61547dd4e53529957cbcc7e7375b
1 node_535ecf989c96179367a664d18c40239c0e11cc27e1e3d6a443a178b6
1 node_7dbd9b7c5d1e38cae04e55762d505be991751ff9896b84f8cde28106
1 node_6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4
1 node_3bfb1ffa2fa9b4d6a8eefab6e15c1138dbc4defeb309958180740698
1 node_3dc0942946af1c1ab9305bb6a73d843f69161ff0c90aad0698962d9b
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/404.0 CPU
0.0/2000.0 Memory
0.00/1963.531 GiB memory
0.00/7.657 GiB object_store_memory
Demands:
{'CPU': 2.0, 'Memory': 10.0}: 2234+ pending tasks/actors
======== Tasks Summary: 2023-03-20 19:23:11.693436 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 0
total_tasks: 4752
Table (group by func_name):
------------------------------------
FUNC_OR_CLASS_NAME STATE_COUNTS TYPE
0 run_on_docker FAILED: 725 NORMAL_TASK
FINISHED: 1612
PENDING_NODE_ASSIGNMENT: 2244
RUNNING: 160
1 check_summary FINISHED: 9 NORMAL_TASK
RUNNING: 1
2 check_uptime PENDING_NODE_ASSIGNMENT: 1 NORMAL_TASK
jjyao
March 20, 2023, 7:48pm
7
This definitely looks wrong. How did you start the worker node? Are you using ray start
?
Could you also show the output of ray list nodes
?
Also I’m happy to do a quick video chat to make the debugging faster.
The worker nodes are freshly spinned up VMs that use ray start
to connect to the head node:
ray start --disable-usage-stats --address="10.10.10.10:6379" --redis-password="****" --resources='{"Memory":"?00"}' --object-store-memory 1048576000
The output of ray list nodes
is as follows:
======== List: 2023-03-20 20:04:11.093174 ========
Stats:
------------------------------
Total: 36
Table:
------------------------------
NODE_ID NODE_IP STATE NODE_NAME RESOURCES_TOTAL
0 194de67549f11701e30f604b5f61991d56f4ce8213f092826b599baa 10.10.44.102 ALIVE 10.10.44.102 CPU: 80.0
Memory: 400.0
memory: 420437946368.0
node:10.10.44.102: 1.0
object_store_memory: 1048576000.0
1 1ba20a961e1ee756177e8d79b3f9903e5a49930d5857f840215cdef7 10.10.78.234 DEAD 10.10.78.234 CPU: 80.0
Memory: 400.0
memory: 420430684160.0
node:10.10.78.234: 1.0
object_store_memory: 1048576000.0
2 2e404ffda35765d9e8636d9fd113f7d806c9571db29c46b173aa1a3b 10.10.129.115 DEAD 10.10.129.115 CPU: 80.0
Memory: 400.0
memory: 420475068416.0
node:10.10.129.115: 1.0
object_store_memory: 1048576000.0
3 2ed83abd5b7496296e623d0e58595144928353e8bf2209ade58ce0fd 10.10.41.214 ALIVE 10.10.41.214 CPU: 80.0
Memory: 400.0
memory: 420476485632.0
node:10.10.41.214: 1.0
object_store_memory: 1048576000.0
4 2edef70fb117ef4bae491e666bafdc0d121367b652414e01de71843b 10.10.91.234 ALIVE 10.10.91.234 CPU: 80.0
Memory: 400.0
memory: 420450050048.0
node:10.10.91.234: 1.0
object_store_memory: 1048576000.0
5 37e23132b9b4e058bc249fd351e386c1db1a1bebdd95791f9042166d 10.10.139.118 DEAD 10.10.139.118 CPU: 80.0
Memory: 400.0
memory: 420495740928.0
node:10.10.139.118: 1.0
object_store_memory: 1048576000.0
6 392a3c17125b9a704037f0c196c2fe40869e7d085fd1654fd8f525f0 10.10.68.22 DEAD 10.10.68.22 CPU: 80.0
Memory: 400.0
memory: 420473323520.0
node:10.10.68.22: 1.0
object_store_memory: 1048576000.0
7 3bfb1ffa2fa9b4d6a8eefab6e15c1138dbc4defeb309958180740698 10.10.83.212 DEAD 10.10.83.212 CPU: 80.0
Memory: 400.0
memory: 420476932096.0
node:10.10.83.212: 1.0
object_store_memory: 1048576000.0
8 3dc0942946af1c1ab9305bb6a73d843f69161ff0c90aad0698962d9b 10.10.90.164 DEAD 10.10.90.164 CPU: 80.0
Memory: 400.0
memory: 420494503936.0
node:10.10.90.164: 1.0
object_store_memory: 1048576000.0
9 40c96f4ec71ceebfee1725980647983716d0c7b61cce25782de1358c 10.10.148.34 DEAD 10.10.148.34 CPU: 80.0
Memory: 400.0
memory: 420462432256.0
node:10.10.148.34: 1.0
object_store_memory: 1048576000.0
10 535ecf989c96179367a664d18c40239c0e11cc27e1e3d6a443a178b6 10.10.191.86 DEAD 10.10.191.86 CPU: 80.0
Memory: 400.0
memory: 420436226048.0
node:10.10.191.86: 1.0
object_store_memory: 1048576000.0
11 590b271f38178f25f962c0a0c4f7aeeba9e0b2a628cd20345ee1abc2 10.10.179.186 DEAD 10.10.179.186 CPU: 80.0
Memory: 400.0
memory: 420499181568.0
node:10.10.179.186: 1.0
object_store_memory: 1048576000.0
12 5b8e173b6e35645a274588bf0d164570d61e2fe81cd037d997e954ba 10.10.179.213 DEAD 10.10.179.213 CPU: 80.0
Memory: 400.0
memory: 420494204928.0
node:10.10.179.213: 1.0
object_store_memory: 1048576000.0
13 5fb36e7769df5394cfe9be2c84cd61547dd4e53529957cbcc7e7375b 10.10.75.22 DEAD 10.10.75.22 CPU: 80.0
Memory: 400.0
memory: 420484145152.0
node:10.10.75.22: 1.0
object_store_memory: 1048576000.0
14 651d1dfe3c463dd047ac60edd1d4c96c96ae49adf7860afa01ec2110 10.10.204.22 ALIVE 10.10.204.22 CPU: 80.0
Memory: 400.0
memory: 420457480192.0
node:10.10.204.22: 1.0
object_store_memory: 1048576000.0
15 6e93afeff52e8c6e1e160fa3f6f723c1279cd62b48fe9dc7876c674e 10.10.55.68 DEAD 10.10.55.68 CPU: 80.0
Memory: 400.0
memory: 420479401984.0
node:10.10.55.68: 1.0
object_store_memory: 1048576000.0
16 6f8998a6993ee6de627e147f610d5194ae67175e75fd8fe8a87172e4 10.10.10.10 ALIVE 10.10.10.10 CPU: 4.0
memory: 5956526900.0
node:10.10.10.10: 1.0
object_store_memory: 2978263449.0
17 7b32547373e1eae789f01a7037f4ca6bef42a38860c23508a8e73c22 10.10.14.11 ALIVE 10.10.14.11 CPU: 80.0
Memory: 400.0
memory: 420470149120.0
node:10.10.14.11: 1.0
object_store_memory: 1048576000.0
18 7dbd9b7c5d1e38cae04e55762d505be991751ff9896b84f8cde28106 10.10.124.217 DEAD 10.10.124.217 CPU: 80.0
Memory: 400.0
memory: 420476768256.0
node:10.10.124.217: 1.0
object_store_memory: 1048576000.0
19 86332cc9670e195f09c362018d4941f4f8f0f0821a2cade1eee471d7 10.10.207.11 DEAD 10.10.207.11 CPU: 80.0
Memory: 400.0
memory: 420494401536.0
node:10.10.207.11: 1.0
object_store_memory: 1048576000.0
20 8691f62768248964ed41b9668a3cd93237f661fbf622935ec97b6fb5 10.10.255.214 ALIVE 10.10.255.214 CPU: 80.0
Memory: 400.0
memory: 420471754752.0
node:10.10.255.214: 1.0
object_store_memory: 1048576000.0
21 95459da3a4f3204badf1dddd240b373c9fa1523195f93f8ba91a0091 10.10.243.8 ALIVE 10.10.243.8 CPU: 80.0
Memory: 400.0
memory: 420472946688.0
node:10.10.243.8: 1.0
object_store_memory: 1048576000.0
22 9d79d5eb00ddf9745b3dbc93b0c72b0dbfe4af56e07a84f6249c622b 10.10.189.12 DEAD 10.10.189.12 CPU: 80.0
Memory: 400.0
memory: 420486901760.0
node:10.10.189.12: 1.0
object_store_memory: 1048576000.0
23 a2201b214cc2c9da7ca4178046b7804e77716610c0bfd750a1232d43 10.10.141.198 DEAD 10.10.141.198 CPU: 80.0
Memory: 400.0
memory: 420456001536.0
node:10.10.141.198: 1.0
object_store_memory: 1048576000.0
24 ab6ac53d1036bf48629094008d77b9567f94fcf17406762340046e01 10.10.82.247 DEAD 10.10.82.247 CPU: 80.0
Memory: 400.0
memory: 420494905344.0
node:10.10.82.247: 1.0
object_store_memory: 1048576000.0
25 abf5c4ec83fa410909232700c6aebf79565de9db5f1a301ea3fee532 10.10.6.246 DEAD 10.10.6.246 CPU: 80.0
Memory: 400.0
memory: 420464316416.0
node:10.10.6.246: 1.0
object_store_memory: 1048576000.0
26 adf9a44c4b30d0da3b3f6ce6205300387139852d1cac38e31ca90b5b 10.10.245.208 DEAD 10.10.245.208 CPU: 80.0
Memory: 400.0
memory: 420472823808.0
node:10.10.245.208: 1.0
object_store_memory: 1048576000.0
27 ba06e0f430bffc2f4b1ccf050cc4e1f76a4adebc438a2fbc0b0420f5 10.10.39.95 DEAD 10.10.39.95 CPU: 80.0
Memory: 400.0
memory: 420504715264.0
node:10.10.39.95: 1.0
object_store_memory: 1048576000.0
28 bf1e7380dc822cfb3be95c9b60deafeb19571b1a672c7f93b544a05c 10.10.177.103 DEAD 10.10.177.103 CPU: 80.0
Memory: 400.0
memory: 420397867008.0
node:10.10.177.103: 1.0
object_store_memory: 1048576000.0
29 c55bc798e65f113da8dfadc4326a00444b7a035782f618a2a539d4d7 10.10.254.21 DEAD 10.10.254.21 CPU: 80.0
Memory: 400.0
memory: 420462739456.0
node:10.10.254.21: 1.0
object_store_memory: 1048576000.0
30 daf2cfba1cbbcf6042b8384f1e8783594e23efb909ac80a6d5db2917 10.10.159.183 DEAD 10.10.159.183 CPU: 80.0
Memory: 400.0
memory: 420472111104.0
node:10.10.159.183: 1.0
object_store_memory: 1048576000.0
31 e1aa43ae23d9bd6d54ec63b122814a48bdacfb340c707cd5693360f4 10.10.241.81 ALIVE 10.10.241.81 CPU: 80.0
Memory: 400.0
memory: 420448931840.0
node:10.10.241.81: 1.0
object_store_memory: 1048576000.0
32 e8e82b708ea09b5cdf0a57a0e15b7dc78a9fda12c535d9285678ae14 10.10.48.207 ALIVE 10.10.48.207 CPU: 80.0
Memory: 400.0
memory: 420466184192.0
node:10.10.48.207: 1.0
object_store_memory: 1048576000.0
33 ea3237fc97a3ce2a1ad3dd4592671cbcf8e404bae2b938cf589bb353 10.10.185.220 DEAD 10.10.185.220 CPU: 80.0
Memory: 400.0
memory: 420465213440.0
node:10.10.185.220: 1.0
object_store_memory: 1048576000.0
34 f049191cd01415da37a663414a282c57d97c1c549c9d2f5372a16dbe 10.10.111.105 DEAD 10.10.111.105 CPU: 2.0
Memory: 20.0
memory: 19528146944.0
node:10.10.111.105: 1.0
object_store_memory: 1048576000.0
35 f69bc8947e4946454297294f90c9f5b5e439324333065729cdd4c52c 10.10.69.250 DEAD 10.10.69.250 CPU: 80.0
Memory: 400.0
memory: 420463702016.0
node:10.10.69.250: 1.0
object_store_memory: 1048576000.0
Happy to chat! I’ll reach out to you on Ray slack channel.
It seems like I get into this problem too, May I ask how this problem solve eventually?
Based on our discussion, one possible reason is the caller script (the one doing ray.get([...])
with thousands of tasks) is running on a remote client, which, when disconnected, lead to malfunction of Ray’s retry mechanism upon worker being killed. Changing the script to make this ray.get
happen on the head node might solve the issue. (However I haven’t tested this in our production environment yet)