With Ray datasets, how do I handle skipping over–e.g, the first 500K batches in a dataset.
The use for this is, when I’m doing a run on spot instances and then the run stops because the spot instance shuts down.
Ideally, I’d store the number of training steps (let’s call it N) that have occurred so far, and then skip over N batches.
@Vedant_Roy Datasets doesn’t have this feature right now. However, we are considering this for supporting fault-tolerance.
Just to understand your use case better for this work:
- Are you running training job the spot instances and you want to restart the training job and resume the dataset position prior to the job death?
- Do you also store how many epochs the job has been consuming in addition to number of batches?