How to get results back from a job (production scenarios)?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Apologies if this has been answered before. I am new to Ray, but I have searched through this forum and Github issues, and I haven’t seen an idiomatic answer for the basic workflow of submitting work and getting results back from the Ray cluster in a production scenario.

  • I want to perform work on a Ray cluster
  • The work will be issued by an existing service in my backend infrastructure
  • The work will produce results
  • I need to get the results back into the service that issued the work

At first, the “Ray Client” seemed like the obvious answer: a simple Python decorator with a blocking call to obtain task results. However, I keep seeing in multiple places the Ray Client is not recommended for production scenarios, because if there is a network interruption the connection to submitted task is lost, and there is no way to reconstruct that connection. (Is this correct?). It isn’t only network disconnection, it’s any interruption, like a redeployment, or OOM event, etc.

Thus, I keep seeing that the recommended way is to submit work using the Job Submission API. This is fine and works well for my scenario, but I can’t seem to find what is considered the idiomatic way to retrieve job results back into my original service?

I understand that I can use a cloud object store like S3 or GCS as an intermediate storage layer. Even here, there is some subtlety because there isn’t an obvious way to communicate the name/path of the saved resource from the job (and back to my service). So I assume it’s required for the job submitter to choose the resource name/path and provide that as part of the job spec; and then the job code will save the results into that resource. The submitter code can poll the job id and upon success, can fetch the saved results from the storage.

I’ve also seen references to the Ray object store as a possible means of transferring the state, but it doesn’t seem to me to get mentioned as the idiomatic preferred pattern, merely that it is “possible”.

What is the preferred idiomatic workflow for this use-case, or is there something else that should be preferred?

Also asked back in July 2023 How do I get data/files out from a ray job?, currently unanswered.

Ok I got an answer to this question on the Ray slack. I don’t want to give names of who said this because I don’t have permission (didn’t ask), but basically:

“You should think of a Ray cluster as basically flammable. In production scenarios anything that uses Ray should be wrapped in external retries and durable external stores.”

So with that I’m going to proceed with saving result state to cloud storage and I’ll figure out some signalling mechanism to let my calling code know that the job is finished. Probably polling job status but it could be something else.