Avoid moving datasets around the network when using tune.with_parameters

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I use ray.tune to tune the parameters of my models. I need to pass pretty large datasets (5 - 20 GB) to the trainable function and I am currently using tune.with_parameters to pass the datasets.

This works fine, but each run requires the datasets to be transferred over the network to each of the nodes in the cluster. I don’t want to do this as it overloads the network and takes a while.

I already have copies of the datasets have on each server (a bunch of files on disk) and I want to read them from the local disk for each of the nodes. How to do this in ray? Also it is very important not to read files from disk for every trainable step.

To avoid reading in the dataset for each trial, you can call Dataset.materialize() on the Ray Datasets prior to passing the data. This ensures that the dataset are materialized in memory, so that we are only passing around memory copies which are read per trial.

This does not solve the problem with dataset transfer between servers