Prefetch data to GPU in `map_batches`

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When doing inference with a pytorch model in ray.data, I often use the following pattern:

class InferenceActor:
    ...
  
    def __call__(self, data):
        x = torch.from_numpy(data["x"]).to(self.device)
        y = model(x)
        return {"y": y.cpu().numpy()}

pipe = (
    ...
    .map(data_load_fn)
    .map_batches(InferenceActor, num_gpus=1)
)

However, I can’t get good GPU utilisation out of this – especially for small models due to the host-device transfer at the start of the actor’s __call__ method. Is there some way to overlap inference with those memory transfers? I found a workaround by simply allocating multiple copies of the same actor on a single GPU, but that only works if the model itself isn’t too big.

I think what you want for this case is the prefetch_batches arg with our iter_batches() APIs. You can read more details here: ray.data.Dataset.iter_torch_batches — Ray 2.34.0

Thanks for your reply! Yeah, that’s where the question comes from – I’ve used iter_torch_batches in conjunction with ray.train. Do you have an example where it’s used similar to my example above. In case of a single GPU-inference it’s pretty easy, but if we set concurrency > 1 on map_batches in the example above, I don’t see how I could replicate that with iter_torch_batches without re-implementing a lot of functionality.

In this case, you can use the max_concurrency ray remote arg into map_batches(). For example, map_batches(..., max_concurrency=2) will prefetch 1 extra batch (1 actor will have the current batch, 1 actor will have the next batch).