I’m using Ray for training with large datasets (~200GB) across multiple GPUs (5 physical machines, with 6 GPUs each). I want to load the dataset once per machine, not per GPU, to optimise memory usage. I want to make this dataset available in memory to each actor, ideally just calling ray.get(ObjectRef).
The approach I’m currently exploring is to instantiate 5 “DatasetActors” with “STRICT_PACK” placement group (so it sends one to each machine? should I use STRICT_SPREAD?). These DatasetActors call ray.put(dataset_object), and then we get back a reference per machine.
How can I now allow my TrainActors to reference the correct dataset_object so they load it from shared memory?
Also, if there is a better/more elegant approach, I’m all ears