How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
When doing inference with a pytorch model in ray.data, I often use the following pattern:
class InferenceActor:
...
def __call__(self, data):
x = torch.from_numpy(data["x"]).to(self.device)
y = model(x)
return {"y": y.cpu().numpy()}
pipe = (
...
.map(data_load_fn)
.map_batches(InferenceActor, num_gpus=1)
)
However, I can’t get good GPU utilisation out of this – especially for small models due to the host-device transfer at the start of the actor’s __call__ method. Is there some way to overlap inference with those memory transfers? I found a workaround by simply allocating multiple copies of the same actor on a single GPU, but that only works if the model itself isn’t too big.
Thanks for your reply! Yeah, that’s where the question comes from – I’ve used iter_torch_batches in conjunction with ray.train. Do you have an example where it’s used similar to my example above. In case of a single GPU-inference it’s pretty easy, but if we set concurrency > 1 on map_batches in the example above, I don’t see how I could replicate that with iter_torch_batches without re-implementing a lot of functionality.
In this case, you can use the max_concurrency ray remote arg into map_batches(). For example, map_batches(..., max_concurrency=2) will prefetch 1 extra batch (1 actor will have the current batch, 1 actor will have the next batch).