Ray 1.11.0 sometimes hangs between ray.get calls

hahdawg · March 30, 2022, 9:40pm

Hello,

I just upgraded from 1.8.0 to 1.11.0. In my distributed application, I have chunks of work where I call ray.init once, I call ray.get a few times, and I call ray.shutdown at the end. My ray.get calls always write data to cloud storage, so I can tell if they’ve run.

Since switching to 1.11.0, in many places, ray gets stuck between certain ray.gets and does nothing. The behavior is deterministic (i.e., it always gets stuck the same pairs of gets).

In one case, I always get a single error that a remote function isn’t registered, and all of the workers write data except one. For the other cases, there is no warning, and no work gets done.

In every case, restarting ray between get calls solved the issue. Is there anything else I can do to it?

Thanks.

Mingwei · March 30, 2022, 10:36pm

Hi @hahdawg , is there an easy way to reproduce the issue?

hahdawg · March 31, 2022, 12:21pm

I was hoping there would be an easy solution. I can try to reproduce in a script.

Mingwei · March 31, 2022, 1:05pm

If this is not easy to reproduce, it is fine. We are making stability improvements in the related codepath and hopefully this issue will be resolved as well. Btw are you using Ray dataset or pyarrow?

hahdawg · March 31, 2022, 1:56pm

I use ray datasets in only one function. I use pyarrow everywhere via pandas and dask.

If it helps, the code does hang in some places where I’m not using any of the above. For example, on N nodes, I three separate ray.gets to

Copy torch models and batch data from cloud storage to each node.
Make predictions on each node and save them to cloud storage
Delete models from each node

Ray hangs between steps 2 and 3 unless I restart it.

Mingwei · March 31, 2022, 3:30pm

Ok, will update here after we merge some changes that may help.

Topic		Replies	Views
Raise errors when ray hangs? Ray Core	1	446	November 14, 2022
Ray ObjectRet doesn't seems to be released from server side Ray Core	1	221	March 18, 2022
Diagnose issue where ray stops working but no error gets raised? Ray Core	1	275	January 15, 2022
Last run of a grid search is hanging Ray Tune	1	360	November 1, 2022
Ray.init() hangs Ray Core	2	931	July 8, 2021

Ray 1.11.0 sometimes hangs between ray.get calls

Related topics