Hmm it’s possible the ray.nodes() discrepancy is because the cluster state takes some time to converge.
If you can reproduce the issue right now, you could try checking where your application is getting stuck. This docs page on debugging might be useful to look at. Here are some relevant tools you can try:
-
ray memory
CLI will tell you which ObjectRefs are currently in scope and which are still pending execution. - Passing the OS environment variable
RAY_record_ref_creation_sites=1
to Ray will provide more information in the above output about which tasks created which ObjectRefs. -
ray stack
CLI will tell you where in Python the current processes are, including the application driver and any task workers.