Hi @lthiet,
Ray Data might be what you’re looking for! Ray Data supports pipelined (lazy loading and preprocessing) of data. You can ingest the data through ray.data.read_csv, ray.data.read_parquet, etc. See Getting Started — Ray 2.3.0 for more info.
This way, you won’t need to store the entire dataset in memory all at once before passing it into training. Under the hood, Ray Data stores dataset blocks in memory across all nodes in your cluster with the Ray Object Store.
You’ll want to use the new Dataset Streaming API introduced in 2.3.0! See here for more info and some snippets you can build on: Developer Preview: Ray Data Streaming Execution - Google Docs
Once you’ve configured Ray Data ingest, you can pass in the Ray Dataset to Tune and start using it:
train_ds = ray.data.read_parquet(...)
def train_fn(config, train_ds=None):
for batch in train_ds.iter_batches(batch_size=32):
pass # do training on the batch
tuner = Tuner(tune.with_parameters(train_fn, train_ds=train_ds))
tuner.fit()
Let me know if this works for you, and also be sure to give feedback to the Ray Data team as mentioned in the “Feedback Wanted” section on that doc.