Hello,
we are considering using ray.data for some data preprocessing pipelines. The pipelines are simple: ingest parquet, apply transforms, write parquet, but work on a lot of data. This means that they will run for a while and ay will run on a cluster where head/driver nodes might die during the long processing time.
The way we deal with this kind of failures right now is to keep the state of the processing when we write parquet fragments, so when we restart the pipeline we can jump forward to not reprocess what was already written.
I’ve been going through the doc of ray and ray.data to understand fault tolerance and while it talks about replay when worker node and the distributed memory fails, it’s not clear what will happen if the head/driver dies and we restart the pipeline. Is ray.data able to do checkpointing and recover from where it stopped or is this something we need to add to ray.data?
Thank you