I am running multi-GPU inference with map_batches but having difficulty understanding why the operations are not processing data evenly or accounting for the block completion as inference proceeds. I have played with varying partition sizes and I always get 1 GPU that is effectively not really used; its memory use is consistent with just loading the model.
# to ray data
ray_ds = ray.data.from_pandas(df).repartition(396)
# convert
ray_ds = ray_ds.map(preprocess_function)
# preds
predictions = ray_ds.map_batches(HuggingFacePredictor, num_gpus=1, batch_size=args.batch_size, concurrency=args.num_devices)
Here is the output, where the amount of data being processed decreases over time, but the blocks hangs:
Running: 0.0/384.0 CPU, 16.0/16.0 GPU, 29.59 GiB/282.72 GiB object_store_memory: 0%| | 1/396 [1:56:33<752:37:05, 6859.31s/it]
Running: 0.0/384.0 CPU, 16.0/16.0 GPU, 29.52 GiB/282.72 GiB object_store_memory: 0%| | 1/396 [1:56:33<752:37:05, 6859.31s/it]
Also I am unclear why one of my GPUs is poorly saturated: