Broken Pipe Error

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi

I have been using Ray Core to distribute Google AI Platform prediction API requests for a large dataset. The daily job was running fine on Google Kubernetes Engine until recently.
And without any change to the script, the job started crashing every day with the following error:

ray.exceptions.RayTaskError(BrokenPipeError): e[36mray::get_predictions_core()e[39m (pid=270, ip=10.0.2.42)

It seems like the distribution and predictions are done correctly until the last chunk, but the job crashes right after.
If I take a subset of the input data, it runs fine. So I am guessing it may be a memory or quota issue?

Any help on this would be greatly appreciated. Thank you!

Below is an excerpt of my code.

@ray.remote
def get_predictions_core(service, name, chunk):
    [.....]
    return predictions

ray.shutdown()
ray.init(dashboard_host='0.0.0.0')
responses = ray.get([get_predictions_core.remote(service, name, chunk) for chunk in tqdm(chunks, desc='* getting predictions in parallel chunks:')])
ray.shutdown()

This indeed sounds like some processes running the get_predictions_core() tasks are killed, maybe because of out-of-memory within GKE container. Do you have visibility into process deaths in GKE?

We can further dig into issues, by going to the POD of 10.0.2.42 and check the content under /tmp/ray/session_latest, it may contain log files for process 270 (python-core-worker-xxx_270.log). raylet.out may also contain useful information.

Otherwise, we are working on adding reports for Ray node and task deaths to driver output.

Hi Mingwei,
Thanks a lot for your reply!
I am not sure how to dig further in the pod, but I will enquire
Thanks again