ray[tune] for a custom optimization objective, I got this error for one of the trials:
Failure # 1 (occurred at 2023-03-22_22-00-06) Traceback (most recent call last): File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event future_result = ray.get(ready_future) File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: ImplicitFunc actor_id: 67f2dd457359b14fc06cdf9101000000 pid: 31132 namespace: 6c94a804-fe49-4945-a548-d00bf7d3dac5 ip: 172.31.0.86 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
By looking in the trial’s directory on the head node, I can see that in the files
progress.csv I can find some identification data which is relevant for the
tune optimization process, stuff like
experiment_id. However, there is little information about where and how the code was run. I can see there’s a
hostname field and
node_ip, and also, the filename for the tensorflow data also contains an IP address (which, btw, is not the same as
node_ip) and what I expect to be some sort of other id.
Anyway, the point was that I wanted to search in the Ray Dashboard for some logs that might explain why the worker crashed (e.g., OOM), but I can’t seem to make the connection between the trial and experiment id and the identification info required to get the logs in the dashboard, i.e., stuff like Job ID, Actor Id, Task Id, Worker Id, Raylet Id, Node Id, Pid etc.
So how does one find the ray logs for a given ray[tune] trial?
Is there anything else I can do to try getting some explanation for the crash above?
Also, maybe it’s relevant that I’m currently using
flaml, which is using an older, deprecated API, despite using version