Diagnose issue where ray stops working but no error gets raised?

Rarely, after doing a distributed job (e.g., training N torch models on N nodes or processing data in parallel), ray will just stop working and not release. AFAIK, it never happens mid-run, when resource usage is very high; instead, it always seems to happen after the last worker has done its job. My uneducated guess is that ray’s somehow not cleaning up correctly.

What’s the best way to diagnose what’s happening? If it helps, I’m using dataproc, ubuntu 18.04, and ray 1.8.

Also, is there a workaround? Being able to retry remote functions is great, but that won’t work if the process just hangs forever.

Thanks.

Is it the main driver script that’s not returning after the job finished? One thing you can try is stack dumping the driver using “py-spy --native”, or Profiling (internal) — Ray v1.9.2