Error after queue initialisation

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I got following error after initialisation of queue:

  File "/tmp/ray/session_2023-03-11_20-32-22_575180_88/runtime_resources/working_dir_files/_ray_pkg_b7b314c25cc80b56/tools/detection_actor.py", line 82, in start_job
    await self.det_holders[job_id].put_async(DetectionObj(dets, job_id))
  File "/usr/local/lib/python3.8/dist-packages/ray/util/queue.py", line 132, in put_async
    await self.actor.put.remote(item, timeout)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: _QueueActor
	actor_id: 7ebe8644ea290e8661bbe7fd07000000
	pid: 1995
	name: 440_dh
	namespace: raypipe
	ip: 172.16.30.130
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.

Here, first I am initialising queue named “det_holder” and pass it to two different actors, first detection actor adds detection object in that, another tracking actor read that object from queue.
After queue initialisation, for first frame only while detection actor tried to put object, it threw mentioned error. Which means, ray couldn’t initialise it properly.

I could not regenerate this issue again. Can anyone suggest, what can be reason/cause behind this?

hey @shyampatel unfortunately it’s hard to tell why it crashed without logs or your code snippet. Would it possible to get the logs of the crashed worker? Logging — Ray 2.3.0 has a bit more context where the worker log resides.

Actually, after this, we have restarted the cluster. If you can guide to fetch logs of the crashed worker, I can forward it.

hi, @shyampatel the log should under /tmp/ray/session_$timestamp on the node where the actor crashed.

I could regenerate this issue again. Can you please tell me which specific log file do you require?