Parallel inference using CPUs

Main question

I have a trained model. I have one GPU node which has 4 GPUs and 40 CPUs. I wish to apply the model in parallel over the 40 CPU nodes to test_data (so each CPU gets 1/40 of the test_data). How can I do this?

More details
I would like to avoid using Ray AIR if possible for two reasons:

  1. It is in beta testing.
  2. I would need to convert my PyTorch DataLoader to a Ray AIR Dataset. The github issues page says that a tutorial on this is planned but not done yet, so I don’t know how to do this.

From googling this question, I see a lot of questions about parallelizing over the 4 GPUs, but since I have 40 CPUs, I think parallelizing over CPUs instead of GPUs would be faster. I am using PyTorch.

A skeleton code that loads the dataset, dataloader, and model are provided below.

from import Dataset
from import DataLoader

my_dataset = Dataset(...)
my_loader = DataLoader(my_dataset, ...)

state_dict = torch.load(model_save_location)

device = torch.device('cpu')
model =


  1. I am aware that I can use Pool from torch.multiprocessing (as detailed here). But I would prefer to use Ray because if I want to scale to multiple nodes in the future (I only have 1 GPU node now), it would be much easier with Ray than without, I think.

@xzf0kgb0bqr.cev2RWU we recently added the guide to move from Torch Datasets/DataLoader to Ray Datasets! Working with PyTorch β€” Ray 3.0.0.dev0