Single node, 4x GPU, map_batches only using 1

How can I get map_batches to take advantage of all GPUs on a single node? nvidia-smi will only show 1 GPU used no matter what.

# get preds
predictions = test_ds_tokenized.map_batches(TorchPredictor(model=model),
num_gpus=1,  # or 4
batch_size=32,
compute=ray.data.ActorPoolStrategy(size=4)  # num gpus in cluster  # or 1
).materialize()

My main class, following: End-to-end: Offline Batch Inference — Ray 2.7.0

class TorchPredictor:
    def __init__(self, model):
        self.model = model.cuda()
        self.model.eval()

    def __call__(self, batch):

        # transform to tensor / attach to GPU
        batch["input_ids"] = torch.as_tensor(batch["input_ids"], dtype=torch.int64, device="cuda")
        batch["attention_mask"] = torch.as_tensor(batch["attention_mask"], dtype=torch.int64, device="cuda")

        # like no_grad
        with torch.inference_mode():

            # forward and back to cpu
            out = self.model.generate(input_ids=batch['input_ids'],
                            attention_mask=batch['attention_mask'],
                            **{
                                "max_length": 750,
                                "do_sample": False,
                                }
                            )

            # decode
            out = tokenizer.batch_decode(out, skip_special_tokens=True)

            return {
                "y_pred": out ,
                }

Hoping to bump this again.

Here is still what is going on in the logs, ray claims to start using 4 GPUs but then goes down to 1.

Running: 0.0/96.0 CPU, 4.0/4.0 GPU, 0.0 MiB/95.2 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
Running: 0.0/96.0 CPU, 1.0/4.0 GPU, 335.95 MiB/95.2 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]

Based on this, I think what’s going on is that your dataset has only 1 block (parallelism=1). This is a bit unusual since Ray Data auto sets the parallelism to at least 2*num_cores by default.

A couple things I would try:

  • upgrade to Ray 2.7, which should better autodetect the parallelism for read data
  • set the kwarg parallelism=N (e.g., N=100) manually when creating the dataset
  • call ds = ds.repartition(N) to force repartition the dataset to more smaller blocks
1 Like

The repartitioning did it! I had a ray → pandas op and pandas → ray op that kept it at 1 block. Thank you!

1 Like