How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi
I have been using Ray Core to distribute Google AI Platform prediction API requests for a large dataset. The daily job was running fine on Google Kubernetes Engine until recently.
And without any change to the script, the job started crashing every day with the following error:
ray.exceptions.RayTaskError(BrokenPipeError): e[36mray::get_predictions_core()e[39m (pid=270, ip=10.0.2.42)
It seems like the distribution and predictions are done correctly until the last chunk, but the job crashes right after.
If I take a subset of the input data, it runs fine. So I am guessing it may be a memory or quota issue?
Any help on this would be greatly appreciated. Thank you!
Below is an excerpt of my code.
@ray.remote
def get_predictions_core(service, name, chunk):
[.....]
return predictions
ray.shutdown()
ray.init(dashboard_host='0.0.0.0')
responses = ray.get([get_predictions_core.remote(service, name, chunk) for chunk in tqdm(chunks, desc='* getting predictions in parallel chunks:')])
ray.shutdown()