Redirect worker logs to the driver

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When a task/actor is executed, the stdout/stderr is redirected to the driver as well as to worker logs. Would it be possible to directly also forward those worker logs to the driver, for example in the driver’s /tmp directory? This would allow easy access to the per-worker logs that aren’t tangled into a single pipe.

Currently, I have a context manager that pipes all stdout/stderr of my driver and parses the outputs into files based on PID. While this works, there are some limitations (the blocking .get() call must happen within the context manager). It would be great if the above were possible to avoid this.

Not sure if I understand the question

When a task/actor is executed, the stdout/stderr is redirected to the driver as well as to worker logs. Would it be possible to directly also forward those worker logs to the driver, for example in the driver’s /tmp directory?

What’s the difference between what ray does now and “directly also forward those worker logs to the driver”? (edited)

Is it about the the prefixed pid, etc.?

Hi, thank you for the quick response! The difference with current behavior would be to have the logs in files directly, with different workers writing to different files.

I’m not sure if I really understand your question. Let me try to clarify a little bit.

Does it make sense? Can you rephrase your questions?

Yep, that makes sense. My understanding is that the current behavior is that worker logs are:

  • Written to log FILES on the worker nodes.
  • Redirected to stderr/stdout on the driver.

My question is: Is there a straightforward way to have the log FILES that are written on the worker nodes to be written to the driver? I wouldn’t just want the logs to be redirected to driver output, but to be written into files onto the driver itself. Each worker node’s logs would correspond to a different file on the driver.

have the log FILES that are written on the worker nodes to be written to the driver

You mean to write the worker log files to the head node? Driver is just the entrypoint script.

Each worker node’s logs would correspond to a different file on the driver.

Each worker node may have a lot of worker log files. Do you just mean sending those log files to the head node? Or do you want to merge the worker log files first into 1 log file and send it to the head node then?

This discussion may benefit from me describing the root issue, rather than focus on the specific technical details.

The general issue I’m running into is the following: On my AWS instance, I connect to a Ray server, and I send ~20 tasks to be done remotely on the server, using ray.remote. I would like to see the outputs form those tasks, but currently run into the following blockers:

  • The stderr/stdout forwarded from the workers to my instance are jumbled together and difficult to parse.
  • There is no easy way to know which worker’s log file I should check to find the outputs.
  • These log files don’t persist after the task is done, as far as I can tell.

I would like to easily find the logs associated with each of my individual tasks, and for those log files to persist when the task completes.

Do you have any suggestions?

cc: @sangcho to follow up since he is working on the improvements to ray logging.

Hi! Circling back on this. Any news?

Sorry. Just saw the thread.’

Btw, for one clarification; We don’t technically redirect logs to the driver. What we do is just to read a log file and stream those into the driver using RPC, and then driver just prints them again.

The stderr/stdout forwarded from the workers to my instance are jumbled together and difficult to parse.

Is it because the output is always formatted in a random way like (name, pid=XYZ)?

There is no easy way to know which worker’s log file I should check to find the outputs.

This is actually a problem we are planning to tackle in this version… For this one, I have a couple questions.

  1. How do you plan to associate the worker logs to the job? via job_id?
  2. Do you use any logging tools & agents? For example, something like this Log Persistence — Ray 3.0.0.dev0
  3. Can you tell me a bit more about why just storing the driver output (which contains all worker logs technically) is not enough?

Thanks for getting back to me! To try and address your questions,

Is it because the output is always formatted in a random way like (name, pid=XYZ)?

Sort of? I want to use the logs to track progress of my tasks. It’s hard to track progress of 20 jobs when the logs are all funneled into a single stream. Although having clearer identification of where each log comes from would help, just that alone doesn’t solve the underlying issue.
To create a dummy example, say I had a function that printed 0 to 100, taking some random amount of time in between each print. I launch 20 tasks that each run this function separately.

def print100():
    for i in range(100):
        print(i)
        time.sleep(random.random())

ray.init()
for _ in range(20):
    print100.remote()

I’d like to track the progress of each task. If the stderr/stdout of each task was re-directed to a separate file, this is easy. I can just look at each of the files and see which numbers have been printed for each. However, when all of prints are directed into a single stream, this is no longer easy to parse by eye.

For your latter series of questions:

  1. I don’t know. Currently, calling .remote() does not return anything that can be used to associate worker logs/files with it. It would be great if that were the case.
  2. I’m unfamiliar with the link that you sent, but can give it a look.
  3. I think that is addressed in my example above, but let me know if that’s not the case.

Thanks!

@felixy12 It seems that you just want to see the logs of each of your ray tasks.
Have you tried to look at those logs via ray dashboard? The can click the log link in the dashboard.