How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
In upgrading to ray 1.12.0 I am now facing this issue where one or more of my workers will freeze with zombie processes and the following line (repeated many times with different object refs) found in the python-core-worker-<hash>.log
[2022-04-20 19:25:41,635 W 55793 55824] reference_count.cc:1413: Object locations requested for 0046b8beecd25f3a298950a29812c5e5f5ce2b7b0100000002080000, but ref already removed. This may be a bug in the distributed reference counting protocol.
This happens at different stages in each run and is very unpredictable and therefore tricky to debug. It seems by the error message that this is already acknowledged as a possible bug. I haven’t found anywhere else it’s been raised – am I the first to surface an example ?
I don’t necessarily have the bandwidth to provide a nicely packaged repro right now but I thought I would throw this out there in case anyone had thoughts ?
- python version 3.7.9
- ray version 1.12.0