Broken Pipe Error

YvesJacquot · May 25, 2022, 4:18pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi

I have been using Ray Core to distribute Google AI Platform prediction API requests for a large dataset. The daily job was running fine on Google Kubernetes Engine until recently.
And without any change to the script, the job started crashing every day with the following error:

ray.exceptions.RayTaskError(BrokenPipeError): e[36mray::get_predictions_core()e[39m (pid=270, ip=10.0.2.42)

It seems like the distribution and predictions are done correctly until the last chunk, but the job crashes right after.
If I take a subset of the input data, it runs fine. So I am guessing it may be a memory or quota issue?

Any help on this would be greatly appreciated. Thank you!

Below is an excerpt of my code.

@ray.remote
def get_predictions_core(service, name, chunk):
    [.....]
    return predictions

ray.shutdown()
ray.init(dashboard_host='0.0.0.0')
responses = ray.get([get_predictions_core.remote(service, name, chunk) for chunk in tqdm(chunks, desc='* getting predictions in parallel chunks:')])
ray.shutdown()

Mingwei · May 26, 2022, 9:09pm

This indeed sounds like some processes running the get_predictions_core() tasks are killed, maybe because of out-of-memory within GKE container. Do you have visibility into process deaths in GKE?

We can further dig into issues, by going to the POD of 10.0.2.42 and check the content under /tmp/ray/session_latest, it may contain log files for process 270 (python-core-worker-xxx_270.log). raylet.out may also contain useful information.

Otherwise, we are working on adding reports for Ray node and task deaths to driver output.

YvesJacquot · May 27, 2022, 1:54pm

Hi Mingwei,
Thanks a lot for your reply!
I am not sure how to dig further in the pod, but I will enquire
Thanks again

Topic		Replies	Views
ray.exceptions.WorkerCrashedError Kubernetes	9	1919	August 22, 2022
[Ray K8s cluster] - Script exit	0	304	July 8, 2023
Entire ray cluster dying unexpectedly Ray Core	11	1046	September 20, 2023
Raylet worker processes are failing Ray Core	3	62	March 5, 2025
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	122	March 12, 2024

Broken Pipe Error

Related topics