Workers freeze with bug in the distributed reference counting protocol

bpsmith · April 26, 2022, 8:21pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

In upgrading to ray 1.12.0 I am now facing this issue where one or more of my workers will freeze with zombie processes and the following line (repeated many times with different object refs) found in the python-core-worker-<hash>.log

[2022-04-20 19:25:41,635 W 55793 55824] reference_count.cc:1413: Object locations requested for 0046b8beecd25f3a298950a29812c5e5f5ce2b7b0100000002080000, but ref already removed. This may be a bug in the distributed reference counting protocol.

This happens at different stages in each run and is very unpredictable and therefore tricky to debug. It seems by the error message that this is already acknowledged as a possible bug. I haven’t found anywhere else it’s been raised – am I the first to surface an example ?

I don’t necessarily have the bandwidth to provide a nicely packaged repro right now but I thought I would throw this out there in case anyone had thoughts ?

python version 3.7.9
ray version 1.12.0

jjyao · April 27, 2022, 9:57pm

cc @Stephanie_Wang

It’d be great if you can provide a repro so we can debug it.

Stephanie_Wang · April 27, 2022, 10:02pm

Definitely seems like a regression! It would be great if you could file an issue on github with more info. A repro would be ideal but otherwise just more information about your workload.

Also, it would be helpful to see the output of ray memory if you rerun this with the OS environment variable RAY_record_ref_creation_sites=1.

Topic		Replies	Views
Crash when reaching 30 workers Ray Core	6	1803	October 19, 2022
Ray crash when use complex function Ray Clusters	3	51	October 9, 2024
Get_objects of worker.py timeout Ray Tune	3	465	June 19, 2022
Automatic-deserialization? - Question about Numpy arrays Ray Core	2	71	January 28, 2025
How to track worker contribution Ray Clusters	2	305	May 22, 2023

Workers freeze with bug in the distributed reference counting protocol

Related topics