Workers freeze with bug in the distributed reference counting protocol

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

In upgrading to ray 1.12.0 I am now facing this issue where one or more of my workers will freeze with zombie processes and the following line (repeated many times with different object refs) found in the python-core-worker-<hash>.log

[2022-04-20 19:25:41,635 W 55793 55824] Object locations requested for 0046b8beecd25f3a298950a29812c5e5f5ce2b7b0100000002080000, but ref already removed. This may be a bug in the distributed reference counting protocol.

This happens at different stages in each run and is very unpredictable and therefore tricky to debug. It seems by the error message that this is already acknowledged as a possible bug. I haven’t found anywhere else it’s been raised – am I the first to surface an example ?

I don’t necessarily have the bandwidth to provide a nicely packaged repro right now but I thought I would throw this out there in case anyone had thoughts ?

  • python version 3.7.9
  • ray version 1.12.0

cc @Stephanie_Wang

It’d be great if you can provide a repro so we can debug it.

Definitely seems like a regression! It would be great if you could file an issue on github with more info. A repro would be ideal but otherwise just more information about your workload.

Also, it would be helpful to see the output of ray memory if you rerun this with the OS environment variable RAY_record_ref_creation_sites=1.