I am implementing a Deep Learning method that needs to evaluate individual gradients w.r.t the inputs in a batch. This would be much faster with multiprocessing, since the individual evaluations don’t depend on each other and can (in theory) be executed in parallel on the GPU. As far as I’m aware, this is possible using ray remote functions with a fractional GPU argument. In addition, I want to utilize distributed learning via DistributedDataParallel. To do so, I use the ray Trainer.
Now, the standard practice for the Trainer would be to set num_workers to the number of GPUs, so that every worker gets 1 CPU and 1 GPU reserved. However, I still need to do multiprocessing in order to not waste my GPUs with individual gradient computations (effectively batch size 1). When I use remote functions for this, new ray workers are created with DIFFERENT resources from the workers of the Trainer.
E.g. my system has 1 GPU (should still work), my Trainer ray worker reserves this GPU, now it can’t be used by the new workers for the multiprocessing. Ideally, I would want the workers for the multiprocessing to use the entire GPU and when they are done, the Trainer worker can use the entire GPU for the standard “.backward()” call on the loss.
How do I go about this in the correct way? I feel like using a Trainer for distributed training and the remote functions for multiprocessing at the same time is not how these features are intended to be used. I need both functionalities though and want to make optimal use of my GPUs. Should I even use the remote functions or something like ray.util.multiprocessing.Pool? Would creating more workers in the Trainer automatically handle the multiprocessing? How would ray know what portions of my code benefit from a multiprocessing speedup and should be executed in parallel?