Reuse data in Ray Tune

mlts · September 8, 2023, 12:55pm

Hello, I am using Ray Tune to do hyper parameter optimization of XGBoost. All my experiments run on the same RayDMatrix input data, which is wrapping a large number of Parquet files. I noticed that every experiment needs to reload data from scratch. Is there a way I can let Tune to load the data once and reuse it across experiments?

f2010126 · September 21, 2023, 8:47am

Hi,

I was able to add the path to a common data location as a parameter to the objective function.

def obj_func(config, data_dir="./data"):
 # config training and evalution code here

To the Tuner object, I passed the obj_func wrapped with tune.with_parameters .

tune.Tuner(tune.with_parameters(obj_func, data_dir="PATH_TO_COMMON_DATA"),
                     .....)

I’ve not tested it with the new TorchTrainer API but it worked with the 2.6 code.

mlts · September 21, 2023, 1:21pm

Hi, thanks for the reply.

I dont think the actors will share data this way. They will still try to load the data separately by themselves. What I want them to do is loading the data only once to object store then all use that in memory version.

f2010126 · September 21, 2023, 9:39pm

You’re right, the actors would not be able to share the data in-memory. I only used the data dir to save on space and time downloading the data into a common location. But the actors each have to process the data if needed.
Actors are separate processes with their own memory allocation. I don’t know if it’s possible to have a simple (without resource management) common memory access.

justinvyu · September 25, 2023, 6:57pm

Thanks @f2010126 for the suggestion! You can actually use tune.with_parameters do exactly that – share a single object store in-memory copy of the data between all Ray Tune trial actors.

@mlts Take a look here for an example of how to do this: Ray Tune FAQ — Ray 2.7.0

Topic		Replies	Views
What is the correct technique for incorporating access to large machine learning data sets in a Ray Tune Tuner() object?	0	70	December 2, 2023
Shared dataset on a local desktop	1	289	March 7, 2023
Ray exec multiple scripts w/ tune.run() to same ray cluster Ray Tune	18	1460	February 14, 2021
Ray.tune - Best practices for reading datasets Ray Tune	1	569	February 18, 2022
Streaming data for training/evaluation/inference Ray Data	3	1333	May 3, 2022

Reuse data in Ray Tune

Related topics