Tune: ray logs for failed tune trial

While running ray[tune] for a custom optimization objective, I got this error for one of the trials:

Failure # 1 (occurred at 2023-03-22_22-00-06)
Traceback (most recent call last):
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: ImplicitFunc
	actor_id: 67f2dd457359b14fc06cdf9101000000
	pid: 31132
	namespace: 6c94a804-fe49-4945-a548-d00bf7d3dac5
	ip: 172.31.0.86
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

By looking in the trial’s directory on the head node, I can see that in the files result.json and progress.csv I can find some identification data which is relevant for the tune optimization process, stuff like trial_id and experiment_id. However, there is little information about where and how the code was run. I can see there’s a hostname field and node_ip, and also, the filename for the tensorflow data also contains an IP address (which, btw, is not the same as node_ip) and what I expect to be some sort of other id.

Anyway, the point was that I wanted to search in the Ray Dashboard for some logs that might explain why the worker crashed (e.g., OOM), but I can’t seem to make the connection between the trial and experiment id and the identification info required to get the logs in the dashboard, i.e., stuff like Job ID, Actor Id, Task Id, Worker Id, Raylet Id, Node Id, Pid etc.

So how does one find the ray logs for a given ray[tune] trial?

Is there anything else I can do to try getting some explanation for the crash above?

Also, maybe it’s relevant that I’m currently using ray[tune] via flaml, which is using an older, deprecated API, despite using version 2.3.0.

The Tune trial was running on the Ray Actor that threw the exceptions (actor ID 67f2dd457359b14fc06cdf9101000000). You should be able to find the logs associated with that on the node with IP 172.31.0.86.

Often, the actual exceptions that caused the Actor to die is higher up in the stack trace (scroll up).

The Tune trial metadata should contain the IP and PID.

Sorry I missed that. I can’t really understand how I managed to do that. It was right there in front of me.

Hmmm… actually, I couldn’t find the log by the actor id. Are log file names (in the Logs section) Actor IDs?

(I couldn’t find it in the “Actors” section of the Ray Dashboard either)

Are you looking on the node the actor was running on?

I thought I was, yes. Now I killed the dashboard. I tried to save the logs in /tmp/ray, though. Maybe the process was killed because of OOM before it got to write anything to logs?

In the mean time, I ran the optimization again, and now I get some clearer messages that the worker was killed because it used too much memory.