Shared dataset on a local desktop

justinvyu · March 7, 2023, 12:03am

Ray Data might be what you’re looking for! Ray Data supports pipelined (lazy loading and preprocessing) of data. You can ingest the data through ray.data.read_csv, ray.data.read_parquet, etc. See Getting Started — Ray 2.3.0 for more info.

This way, you won’t need to store the entire dataset in memory all at once before passing it into training. Under the hood, Ray Data stores dataset blocks in memory across all nodes in your cluster with the Ray Object Store.

You’ll want to use the new Dataset Streaming API introduced in 2.3.0! See here for more info and some snippets you can build on: Developer Preview: Ray Data Streaming Execution - Google Docs

Once you’ve configured Ray Data ingest, you can pass in the Ray Dataset to Tune and start using it:

train_ds = ray.data.read_parquet(...)

def train_fn(config, train_ds=None):
    for batch in train_ds.iter_batches(batch_size=32):
        pass  # do training on the batch

tuner = Tuner(tune.with_parameters(train_fn, train_ds=train_ds))
tuner.fit()

Let me know if this works for you, and also be sure to give feedback to the Ray Data team as mentioned in the “Feedback Wanted” section on that doc.

Topic		Replies	Views
Accessing Large Static Datasets with Ray Clusters	3	685	May 27, 2023
Loading dataset once per machine in ray cluster	1	238	December 5, 2023
Avoid moving datasets around the network when using tune.with_parameters Ray Tune	2	78	July 29, 2024
Ray.data.read_csv Huge Dataset memory limitations	0	254	September 5, 2023
Ray.tune - Best practices for reading datasets Ray Tune	1	607	February 18, 2022

Shared dataset on a local desktop

Related topics