Ray job is stuck when node worker runs on is killed

Stephanie_Wang · July 1, 2022, 7:31pm

Hmm it’s possible the ray.nodes() discrepancy is because the cluster state takes some time to converge.

If you can reproduce the issue right now, you could try checking where your application is getting stuck. This docs page on debugging might be useful to look at. Here are some relevant tools you can try:

ray memory CLI will tell you which ObjectRefs are currently in scope and which are still pending execution.
Passing the OS environment variable RAY_record_ref_creation_sites=1 to Ray will provide more information in the above output about which tasks created which ObjectRefs.
ray stack CLI will tell you where in Python the current processes are, including the application driver and any task workers.

Topic		Replies	Views
How does Ray get over workers killing/revival? Ray Core	6	1493	June 9, 2023
What happen if one of the workers goes down in between execution Ray Core	2	369	February 5, 2021
When does a `Worker` fail to set `core_worker`? Ray Core	3	127	October 4, 2024
Ray worker behaviour Ray Core	8	601	April 10, 2023
Ray tasks sometimes hang in PENDING_NODE_ASSIGNMENT Ray Core	12	1550	January 9, 2023

Ray job is stuck when node worker runs on is killed

Related topics