Ray 2.9.3: map_batches and multi-gpu -- not processing partition blocks / evenly sharding

I am running multi-GPU inference with map_batches but having difficulty understanding why the operations are not processing data evenly or accounting for the block completion as inference proceeds. I have played with varying partition sizes and I always get 1 GPU that is effectively not really used; its memory use is consistent with just loading the model.

# to ray data
ray_ds = ray.data.from_pandas(df).repartition(396)

# convert 
ray_ds = ray_ds.map(preprocess_function)

# preds
predictions = ray_ds.map_batches(HuggingFacePredictor, num_gpus=1, batch_size=args.batch_size, concurrency=args.num_devices)

Here is the output, where the amount of data being processed decreases over time, but the blocks hangs:

Running: 0.0/384.0 CPU, 16.0/16.0 GPU, 29.59 GiB/282.72 GiB object_store_memory:   0%|          | 1/396 [1:56:33<752:37:05, 6859.31s/it]
Running: 0.0/384.0 CPU, 16.0/16.0 GPU, 29.52 GiB/282.72 GiB object_store_memory:   0%|          | 1/396 [1:56:33<752:37:05, 6859.31s/it]

Also I am unclear why one of my GPUs is poorly saturated:

Downgraded to Ray 2.7.0 and this issue does not exist.

Thanks for reporting this localh; can you create a gh ticket on ray and attach a repro script? I’ll get our relevant on-calls to take a look.