Shared dataset on a local desktop

Hello,

I am currently using Ray as a way to find hyperparameters for my model. My setup is a desktop computer that has enough VRAM to host 4 models, but not enough RAM to store 4 instances of the dataset.

I am trying to use this snippet: Ray Tune FAQ — Ray 2.3.0 in order to have a single dataset where the reference is shared accross each trials that are running on my PC. Unfortunately, memory usage seems to increase too much anyway and the script crashes, despite knowing that the dataset object has only been instanced once.

You can find my script here: from ray import air, tunefrom absl import appfrom absl import flagsfrom ty - Pastebin.com relevant LOC are 76, 83, 38

What am I doing wrong? Is there another way to achieve what I am describing?

Many thanks!

Hi @lthiet,

Ray Data might be what you’re looking for! Ray Data supports pipelined (lazy loading and preprocessing) of data. You can ingest the data through ray.data.read_csv, ray.data.read_parquet, etc. See Getting Started — Ray 2.3.0 for more info.

This way, you won’t need to store the entire dataset in memory all at once before passing it into training. Under the hood, Ray Data stores dataset blocks in memory across all nodes in your cluster with the Ray Object Store.

You’ll want to use the new Dataset Streaming API introduced in 2.3.0! See here for more info and some snippets you can build on: Developer Preview: Ray Data Streaming Execution - Google Docs

Once you’ve configured Ray Data ingest, you can pass in the Ray Dataset to Tune and start using it:

train_ds = ray.data.read_parquet(...)

def train_fn(config, train_ds=None):
    for batch in train_ds.iter_batches(batch_size=32):
        pass  # do training on the batch

tuner = Tuner(tune.with_parameters(train_fn, train_ds=train_ds))
tuner.fit()

Let me know if this works for you, and also be sure to give feedback to the Ray Data team as mentioned in the “Feedback Wanted” section on that doc.