Hi all!
I have a pretty hefty-sized dataset (at least for me and my machine) of about 10 GB when zipped. I am attempting to use Ray to perform hyperparameter tuning on the model I’m training on this dataset. However, starting the clusters using this dataset is extremely slow. This is obvious in retrospect, but I’m unsure where to go. Ideally, I would love to point the clusters at the files on my disk and say, “There you go, they exist, don’t make copies,” since the files are read-only after they have been initially created. I would use bind mounts with Docker, but I’m not sure how to approach such a problem in a distributed context like with Ray. I likely will have to use a randomized grid search for my research as the timeline is very short, but for my edification and those who come after me, what is the best way to handle this kind of situation? Copying the files into the clusters clearly isn’t efficient. I don’t particularly want to completely change the data loader from the current file loading that it does to some kind of weird network-based loader. But if that’s the best way to do it, what’s the easiest way to maintain the most similar semantics?
Thanks for any suggestions or advice!