Tune: ray logs for failed tune trial

bbudescu · March 23, 2023, 9:44am

While running ray[tune] for a custom optimization objective, I got this error for one of the trials:

Failure # 1 (occurred at 2023-03-22_22-00-06)
Traceback (most recent call last):
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: ImplicitFunc
	actor_id: 67f2dd457359b14fc06cdf9101000000
	pid: 31132
	namespace: 6c94a804-fe49-4945-a548-d00bf7d3dac5
	ip: 172.31.0.86
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

By looking in the trial’s directory on the head node, I can see that in the files result.json and progress.csv I can find some identification data which is relevant for the tune optimization process, stuff like trial_id and experiment_id. However, there is little information about where and how the code was run. I can see there’s a hostname field and node_ip, and also, the filename for the tensorflow data also contains an IP address (which, btw, is not the same as node_ip) and what I expect to be some sort of other id.

Anyway, the point was that I wanted to search in the Ray Dashboard for some logs that might explain why the worker crashed (e.g., OOM), but I can’t seem to make the connection between the trial and experiment id and the identification info required to get the logs in the dashboard, i.e., stuff like Job ID, Actor Id, Task Id, Worker Id, Raylet Id, Node Id, Pid etc.

So how does one find the ray logs for a given ray[tune] trial?

Is there anything else I can do to try getting some explanation for the crash above?

Also, maybe it’s relevant that I’m currently using ray[tune] via flaml, which is using an older, deprecated API, despite using version 2.3.0.

Yard1 · March 23, 2023, 4:49pm

The Tune trial was running on the Ray Actor that threw the exceptions (actor ID 67f2dd457359b14fc06cdf9101000000). You should be able to find the logs associated with that on the node with IP 172.31.0.86.

Often, the actual exceptions that caused the Actor to die is higher up in the stack trace (scroll up).

The Tune trial metadata should contain the IP and PID.

bbudescu · March 24, 2023, 11:23am

Sorry I missed that. I can’t really understand how I managed to do that. It was right there in front of me.

bbudescu · March 24, 2023, 11:35am

Hmmm… actually, I couldn’t find the log by the actor id. Are log file names (in the Logs section) Actor IDs?

(I couldn’t find it in the “Actors” section of the Ray Dashboard either)

Yard1 · March 24, 2023, 4:47pm

Are you looking on the node the actor was running on?

bbudescu · March 27, 2023, 7:27am

I thought I was, yes. Now I killed the dashboard. I tried to save the logs in /tmp/ray, though. Maybe the process was killed because of OOM before it got to write anything to logs?

In the mean time, I ran the optimization again, and now I get some clearer messages that the worker was killed because it used too much memory.

Topic		Replies	Views
Frequent SIGSEGV running tune Ray Tune	0	528	August 3, 2023
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4212	May 8, 2023
[tune][ray 2.1] The actor died unexpectedly before finishing this task	5	1121	February 8, 2023
Ray Tune jobs fails with no explicit reasons Ray Tune	12	600	April 12, 2023
Entire ray cluster dying unexpectedly Ray Core	11	1070	September 20, 2023

Tune: ray logs for failed tune trial

Related topics