While running ray[tune]
for a custom optimization objective, I got this error for one of the trials:
Failure # 1 (occurred at 2023-03-22_22-00-06)
Traceback (most recent call last):
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
future_result = ray.get(ready_future)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 67f2dd457359b14fc06cdf9101000000
pid: 31132
namespace: 6c94a804-fe49-4945-a548-d00bf7d3dac5
ip: 172.31.0.86
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
By looking in the trial’s directory on the head node, I can see that in the files result.json
and progress.csv
I can find some identification data which is relevant for the tune
optimization process, stuff like trial_id
and experiment_id
. However, there is little information about where and how the code was run. I can see there’s a hostname
field and node_ip
, and also, the filename for the tensorflow data also contains an IP address (which, btw, is not the same as node_ip
) and what I expect to be some sort of other id.
Anyway, the point was that I wanted to search in the Ray Dashboard for some logs that might explain why the worker crashed (e.g., OOM), but I can’t seem to make the connection between the trial and experiment id and the identification info required to get the logs in the dashboard, i.e., stuff like Job ID, Actor Id, Task Id, Worker Id, Raylet Id, Node Id, Pid etc.
So how does one find the ray logs for a given ray[tune] trial?
Is there anything else I can do to try getting some explanation for the crash above?
Also, maybe it’s relevant that I’m currently using ray[tune]
via flaml
, which is using an older, deprecated API, despite using version 2.3.0
.