How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Sometimes a call to a remote actor function blocks without ever returning. As I understand remote function is supposed to return immediately.
I’ve tried to look at all logs in verbose and can’t find anything (i don’t see the actor function starting in the worker)
A little bit about my client.
I use ray client from a process outside the ray cluster and call the ray client from multiple threads in this process.
I am new to ray and I would appreciate any help in understanding which logs will show what the client is actually doing
I’ve upgraded to 1.12.0 and this is the callstack that I get
Thread 17484 (idle): “slice processing_0”
_async_send (ray/util/client/dataclient.py:426)
ReleaseObject (ray/util/client/dataclient.py:531)
_release_server (ray/util/client/worker.py:628)
call_release (ray/util/client/worker.py:622) del (ray/util/client/common.py:110)
put (queue.py:132)
_async_send (ray/util/client/dataclient.py:432)
Schedule (ray/util/client/dataclient.py:535)
_call_schedule_for_task (ray/util/client/worker.py:592)
call_remote (ray/util/client/worker.py:550)
call_remote (ray/util/client/api.py:109)
remote (ray/util/client/common.py:540)
(detection_scan_server/processing_manager.py:116)
_run_defects_creation (detection_scan_server/processing_manager.py:116)
_thread_process_slice (detection_scan_server/processing_manager.py:57)
run (concurrent/futures/thread.py:57)
_worker (concurrent/futures/thread.py:80)
run (threading.py:870)
_bootstrap_inner (threading.py:932)
_bootstrap (threading.py:890)
to me it looks like the issue in
i just run actor remote functions in a loop and it happens
So I don’t understand what I am doing wrong since it would seem like everyone should be suffering from this bug
@shiranbi yes this is a known problem which we don’t have a good fix for Python 3.6. Are you using Python 3.7 or above? If yes, we can try to merge in a mitigation for it.
Hi @Mingwei,
Yes, I am working with python3.8, trying to migrate to python3.9 but that might still be a while.
I would really appreciate anything that could help, I had to disable my entire project since this happens so often
I have no problem trying night release, I tried running with nightly a few days ago and it failed on something else completely but I have no problem trying it again once the pull request is merged
@Mingwei i used the nightly build, from what I can tell it includes your fix.
I ran it several times and my tests pass, I will start using this version in larger scale next week and will be able to see it in more uses cases.
I have noticed that since it is a timing issue changing one thing sometimes makes the deadlock less frequent so only time will tell if it solved the entire problem
Thank you very much