Diagnose issue where ray stops working but no error gets raised?

hahdawg · January 13, 2022, 9:02pm

Rarely, after doing a distributed job (e.g., training N torch models on N nodes or processing data in parallel), ray will just stop working and not release. AFAIK, it never happens mid-run, when resource usage is very high; instead, it always seems to happen after the last worker has done its job. My uneducated guess is that ray’s somehow not cleaning up correctly.

What’s the best way to diagnose what’s happening? If it helps, I’m using dataproc, ubuntu 18.04, and ray 1.8.

Also, is there a workaround? Being able to retry remote functions is great, but that won’t work if the process just hangs forever.

Thanks.

ericl · January 15, 2022, 1:15am

Is it the main driver script that’s not returning after the job finished? One thing you can try is stack dumping the driver using “py-spy --native”, or Profiling (internal) — Ray v1.9.2

Topic		Replies	Views
Raise errors when ray hangs? Ray Core	1	444	November 14, 2022
Ray 1.11.0 sometimes hangs between ray.get calls Ray Core	5	291	March 31, 2022
Ray ObjectRet doesn't seems to be released from server side Ray Core	1	221	March 18, 2022
Next steps for debugging Ray process? Ray Core	1	567	April 5, 2021
Ray tasks lost on node failiure, how to debug? Ray Core	5	630	June 17, 2021

Diagnose issue where ray stops working but no error gets raised?

Related topics