How do I "resume" a dataset?

Vedant_Roy · October 20, 2022, 5:59am

With Ray datasets, how do I handle skipping over–e.g, the first 500K batches in a dataset.
The use for this is, when I’m doing a run on spot instances and then the run stops because the spot instance shuts down.

Ideally, I’d store the number of training steps (let’s call it N) that have occurred so far, and then skip over N batches.

jianxiao · October 21, 2022, 12:28am

@Vedant_Roy Datasets doesn’t have this feature right now. However, we are considering this for supporting fault-tolerance.

Just to understand your use case better for this work:

Are you running training job the spot instances and you want to restart the training job and resume the dataset position prior to the job death?
Do you also store how many epochs the job has been consuming in addition to number of batches?

MoFHeka · August 20, 2024, 9:20pm

@jianxiao Any progress?

rliaw · August 28, 2024, 6:10pm

Hey @MoFHeka - we’re currently not planning to support this in Ray Data. Can you share a bit about your use case?

rliaw · August 28, 2024, 6:11pm

If there is interest, I would recommend opening a github issue as a feature request tracker

Topic		Replies	Views
[train] Resuming Checkpoints in experiment using Trainer	3	805	September 16, 2022
Can we share dataset among users Ray Data	1	97	April 9, 2024
Resuming experiment checkpoint hangs	4	217	November 1, 2023
Shared dataset on a local desktop	1	286	March 7, 2023
[SGD] [Tune] Issue with ray.util.sgd.data.Dataset API Ray Tune	6	480	April 23, 2021

How do I "resume" a dataset?

Related topics