Ray job is stuck when node worker runs on is killed

Hmm it’s possible the ray.nodes() discrepancy is because the cluster state takes some time to converge.

If you can reproduce the issue right now, you could try checking where your application is getting stuck. This docs page on debugging might be useful to look at. Here are some relevant tools you can try:

  • ray memory CLI will tell you which ObjectRefs are currently in scope and which are still pending execution.
  • Passing the OS environment variable RAY_record_ref_creation_sites=1 to Ray will provide more information in the above output about which tasks created which ObjectRefs.
  • ray stack CLI will tell you where in Python the current processes are, including the application driver and any task workers.