Accessing Large Static Datasets with Ray Clusters

Hi all!

I have a pretty hefty-sized dataset (at least for me and my machine) of about 10 GB when zipped. I am attempting to use Ray to perform hyperparameter tuning on the model I’m training on this dataset. However, starting the clusters using this dataset is extremely slow. This is obvious in retrospect, but I’m unsure where to go. Ideally, I would love to point the clusters at the files on my disk and say, “There you go, they exist, don’t make copies,” since the files are read-only after they have been initially created. I would use bind mounts with Docker, but I’m not sure how to approach such a problem in a distributed context like with Ray. I likely will have to use a randomized grid search for my research as the timeline is very short, but for my edification and those who come after me, what is the best way to handle this kind of situation? Copying the files into the clusters clearly isn’t efficient. I don’t particularly want to completely change the data loader from the current file loading that it does to some kind of weird network-based loader. But if that’s the best way to do it, what’s the easiest way to maintain the most similar semantics?

Thanks for any suggestions or advice!

Hey @ProfDoof , can you share your cluster settings (e.g. how many nodes do you have? where is the dataset stored? Are you using shared file system across workers?) Also, can you share the code of your data loading logic? e.g. are you using pytorch dataloader?

One solution I can think of is to use Ray Data. It stores your data in the shared object store, so you don’t have to copy your data to every node in your cluster.

For example: in Training a Torch Image Classifier — Ray 2.4.0

You can convert your pytorch dataset to Ray dataset with ray.data.from_torch()

train_dataset = torchvision.datasets.CIFAR10("data", download=True, train=True)

train_dataset: ray.data.Dataset = ray.data.from_torch(train_dataset)

Then use session.get_dataset_shard("train") to get it in the worker, which will be automatically sharded.

Hi, I don’t understand the nose system well enough to have gotten the nodes working properly so I don’t have a set up in that case. Also, how can I do a shared file system. I found no method to do that.

  • You can use a shared FS and mount it on all the worker node.
  • You can use a cloud storage like S3 and gcs, and download your data onto each node.
  • If you don’t want to use Shared FS or cloud storage, you can simply use Ray Data:
    • First have your data on the head node, and build a Ray Dataset from it
    • then passing the Ray dataset to AIR Trainer, and Ray will help you shard and send data into all workers.