I am currently using Ray as a way to find hyperparameters for my model. My setup is a desktop computer that has enough VRAM to host 4 models, but not enough RAM to store 4 instances of the dataset.
I am trying to use this snippet: Ray Tune FAQ — Ray 2.3.0 in order to have a single dataset where the reference is shared accross each trials that are running on my PC. Unfortunately, memory usage seems to increase too much anyway and the script crashes, despite knowing that the dataset object has only been instanced once.
Ray Data might be what you’re looking for! Ray Data supports pipelined (lazy loading and preprocessing) of data. You can ingest the data through ray.data.read_csv, ray.data.read_parquet, etc. See Getting Started — Ray 2.3.0 for more info.
This way, you won’t need to store the entire dataset in memory all at once before passing it into training. Under the hood, Ray Data stores dataset blocks in memory across all nodes in your cluster with the Ray Object Store.
Once you’ve configured Ray Data ingest, you can pass in the Ray Dataset to Tune and start using it:
train_ds = ray.data.read_parquet(...)
def train_fn(config, train_ds=None):
for batch in train_ds.iter_batches(batch_size=32):
pass # do training on the batch
tuner = Tuner(tune.with_parameters(train_fn, train_ds=train_ds))
tuner.fit()
Let me know if this works for you, and also be sure to give feedback to the Ray Data team as mentioned in the “Feedback Wanted” section on that doc.