Hello, I am using Ray Tune to do hyper parameter optimization of XGBoost. All my experiments run on the same RayDMatrix input data, which is wrapping a large number of Parquet files. I noticed that every experiment needs to reload data from scratch. Is there a way I can let Tune to load the data once and reuse it across experiments?
Hi,
I was able to add the path to a common data location as a parameter to the objective function.
def obj_func(config, data_dir="./data"):
# config training and evalution code here
To the Tuner object, I passed the obj_func
wrapped with tune.with_parameters
.
tune.Tuner(tune.with_parameters(obj_func, data_dir="PATH_TO_COMMON_DATA"),
.....)
I’ve not tested it with the new TorchTrainer API but it worked with the 2.6 code.
Hi, thanks for the reply.
I dont think the actors will share data this way. They will still try to load the data separately by themselves. What I want them to do is loading the data only once to object store then all use that in memory version.
You’re right, the actors would not be able to share the data in-memory. I only used the data dir to save on space and time downloading the data into a common location. But the actors each have to process the data if needed.
Actors are separate processes with their own memory allocation. I don’t know if it’s possible to have a simple (without resource management) common memory access.
Thanks @f2010126 for the suggestion! You can actually use tune.with_parameters
do exactly that – share a single object store in-memory copy of the data between all Ray Tune trial actors.
@mlts Take a look here for an example of how to do this: Ray Tune FAQ — Ray 2.7.0