Shared dataset on a local desktop

lthiet · February 1, 2023, 4:14pm

Hello,

I am currently using Ray as a way to find hyperparameters for my model. My setup is a desktop computer that has enough VRAM to host 4 models, but not enough RAM to store 4 instances of the dataset.

I am trying to use this snippet: Ray Tune FAQ — Ray 2.43.0 in order to have a single dataset where the reference is shared accross each trials that are running on my PC. Unfortunately, memory usage seems to increase too much anyway and the script crashes, despite knowing that the dataset object has only been instanced once.

You can find my script here: from ray import air, tunefrom absl import appfrom absl import flagsfrom ty - Pastebin.com relevant LOC are 76, 83, 38

What am I doing wrong? Is there another way to achieve what I am describing?

Many thanks!

justinvyu · March 7, 2023, 12:03am

Hi @lthiet,

Ray Data might be what you’re looking for! Ray Data supports pipelined (lazy loading and preprocessing) of data. You can ingest the data through ray.data.read_csv, ray.data.read_parquet, etc. See Getting Started — Ray 2.3.0 for more info.

This way, you won’t need to store the entire dataset in memory all at once before passing it into training. Under the hood, Ray Data stores dataset blocks in memory across all nodes in your cluster with the Ray Object Store.

You’ll want to use the new Dataset Streaming API introduced in 2.3.0! See here for more info and some snippets you can build on: Developer Preview: Ray Data Streaming Execution - Google Docs

Once you’ve configured Ray Data ingest, you can pass in the Ray Dataset to Tune and start using it:

train_ds = ray.data.read_parquet(...)

def train_fn(config, train_ds=None):
    for batch in train_ds.iter_batches(batch_size=32):
        pass  # do training on the batch

tuner = Tuner(tune.with_parameters(train_fn, train_ds=train_ds))
tuner.fit()

Let me know if this works for you, and also be sure to give feedback to the Ray Data team as mentioned in the “Feedback Wanted” section on that doc.

Topic		Replies	Views
Ray.tune - Best practices for reading datasets Ray Tune	1	581	February 18, 2022
Accessing Large Static Datasets with Ray Clusters	3	623	May 27, 2023
Streaming data for training/evaluation/inference Ray Data	3	1345	May 3, 2022
Massive disk usage when using ray.data Ray Data	1	647	January 9, 2023
Memory exhausting problem when using Dataset (from ray.data) with RLLib RLlib	2	292	October 12, 2022

Shared dataset on a local desktop

Related topics