Loading dataset once per machine in ray cluster

Harry_Songhurst · November 24, 2023, 9:10am

I’m using Ray for training with large datasets (~200GB) across multiple GPUs (5 physical machines, with 6 GPUs each). I want to load the dataset once per machine, not per GPU, to optimise memory usage. I want to make this dataset available in memory to each actor, ideally just calling ray.get(ObjectRef).

The approach I’m currently exploring is to instantiate 5 “DatasetActors” with “STRICT_PACK” placement group (so it sends one to each machine? should I use STRICT_SPREAD?). These DatasetActors call ray.put(dataset_object), and then we get back a reference per machine.

How can I now allow my TrainActors to reference the correct dataset_object so they load it from shared memory?

Also, if there is a better/more elegant approach, I’m all ears

Harry_Songhurst · December 5, 2023, 7:27am

anyone have any ideas? Cheers

Topic		Replies	Views
Shared dataset on a local desktop	1	293	March 7, 2023
Optimal way to load in a common dataset to an RL env when using many workers RLlib	5	324	July 5, 2022
Accessing Large Static Datasets with Ray Clusters	3	618	May 27, 2023
Is it possible to share objects between different driver processes? Ray Core	1	635	July 22, 2022
Ray/Plasma backed array	15	1276	March 8, 2021

Loading dataset once per machine in ray cluster

Related topics