Avoid moving datasets around the network when using tune.with_parameters

hypersousage · July 13, 2024, 11:41am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I use ray.tune to tune the parameters of my models. I need to pass pretty large datasets (5 - 20 GB) to the trainable function and I am currently using tune.with_parameters to pass the datasets.

This works fine, but each run requires the datasets to be transferred over the network to each of the nodes in the cluster. I don’t want to do this as it overloads the network and takes a while.

I already have copies of the datasets have on each server (a bunch of files on disk) and I want to read them from the local disk for each of the nodes. How to do this in ray? Also it is very important not to read files from disk for every trainable step.

sjl · July 17, 2024, 4:26pm

To avoid reading in the dataset for each trial, you can call Dataset.materialize() on the Ray Datasets prior to passing the data. This ensures that the dataset are materialized in memory, so that we are only passing around memory copies which are read per trial.

hypersousage · July 29, 2024, 12:07pm

This does not solve the problem with dataset transfer between servers

Topic		Replies	Views
Ray.tune - Best practices for reading datasets Ray Tune	1	569	February 18, 2022
Shared dataset on a local desktop	1	289	March 7, 2023
Ray dataset map_batches/map_groups params as part of ray tune hyperparams?	3	414	January 20, 2023
StatusCode.RESOURCE_EXHAUSTED Ray Tune	21	5166	April 11, 2023
Accessing Large Static Datasets with Ray Clusters	3	591	May 27, 2023

Avoid moving datasets around the network when using tune.with_parameters

Related topics