Hello,
I just upgraded from 1.8.0 to 1.11.0. In my distributed application, I have chunks of work where I call ray.init once, I call ray.get a few times, and I call ray.shutdown at the end. My ray.get calls always write data to cloud storage, so I can tell if they’ve run.
Since switching to 1.11.0, in many places, ray gets stuck between certain ray.gets and does nothing. The behavior is deterministic (i.e., it always gets stuck the same pairs of gets).
In one case, I always get a single error that a remote function isn’t registered, and all of the workers write data except one. For the other cases, there is no warning, and no work gets done.
In every case, restarting ray between get calls solved the issue. Is there anything else I can do to it?
Thanks.
Hi @hahdawg , is there an easy way to reproduce the issue?
I was hoping there would be an easy solution. I can try to reproduce in a script.
If this is not easy to reproduce, it is fine. We are making stability improvements in the related codepath and hopefully this issue will be resolved as well. Btw are you using Ray dataset or pyarrow?
I use ray datasets in only one function. I use pyarrow everywhere via pandas and dask.
If it helps, the code does hang in some places where I’m not using any of the above. For example, on N nodes, I three separate ray.gets to
- Copy torch models and batch data from cloud storage to each node.
- Make predictions on each node and save them to cloud storage
- Delete models from each node
Ray hangs between steps 2 and 3 unless I restart it.
Ok, will update here after we merge some changes that may help.