Ray.data recovery/checkpointing

Mortimerp9 · September 27, 2024, 8:30am

Hello,

we are considering using ray.data for some data preprocessing pipelines. The pipelines are simple: ingest parquet, apply transforms, write parquet, but work on a lot of data. This means that they will run for a while and ay will run on a cluster where head/driver nodes might die during the long processing time.

The way we deal with this kind of failures right now is to keep the state of the processing when we write parquet fragments, so when we restart the pipeline we can jump forward to not reprocess what was already written.

I’ve been going through the doc of ray and ray.data to understand fault tolerance and while it talks about replay when worker node and the distributed memory fails, it’s not clear what will happen if the head/driver dies and we restart the pipeline. Is ray.data able to do checkpointing and recover from where it stopped or is this something we need to add to ray.data?

Thank you

Topic		Replies	Views
Problem with anything on Ray Ray Data	2	626	April 20, 2022
[Ray Data] error with read_parquet from hdfs Ray Data	9	825	April 13, 2023
Trial checkpointing	0	292	June 16, 2023
Can we do ray.data.read_parquet for selected partitions for already partitioned data?	0	379	August 11, 2021
Ray Dataset with Distributed PyTorch Ray Data	1	600	April 22, 2022

Ray.data recovery/checkpointing

Related topics