Multi-GPU ray + accelerate HPO

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.50.0
  • Python version: 3.10
  • OS: Linux
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant): accelerate

3. What happened vs. what you expected:

  • Expected: Accelerator from accelerate uses all gpus provided to worker loop function, when they’re visible with CUDA_VISIBLE_DEVICES provided by actor with resources_per_worker.
  • Actual: Accelerator uses single gpu even if several gpus are visible in worker loop function.

Let’s say I have 4 GPUs, I specified in resources_per_worker that it can use 2 GPUS, when I start my HPO ray shows that it uses 2 GPUs for each actor (but in reality uses only 1 gpu per actor). I checked that CUDA_VISIBLE_DEVICES inside worker_loop_function has expected GPUs and it is, but right after I create Accelerator it has attribute num_processes=1 and correspondingly uses 1 single GPU, I tried to use accelerate launch instead, but it just doesn’t start any worker loop function.

I have code that looks something like this:

import os
from typing import Any

import ray
from accelerate import Accelerator
from ray.train import RunConfig
from ray.train.torch import TorchTrainer
from ray.tune import TuneConfig, Tuner, choice
from ray.tune.search.optuna import OptunaSearch


def hpo_loop(config: dict[str, Any]) -> list[tuple[float, str]]:
    # gives 0,1,2
    print(os.getenv("CUDA_VISIBLE_DEVICES"))

    accelerator = Accelerator(mixed_precision="bf16", cpu=False)

    # gives 1
    print(accelerator.num_processes)

    # gives MULTI_GPU, so I don't think this is the reason
    print(accelerator.distributed_type)


if __name__ == "__main__":
    resources_per_worker = {"CPU": 4, "GPU": 2}

    trainer = TorchTrainer(
        train_loop_per_worker=hpo_loop,
        scaling_config=ray.train.ScalingConfig(
            resources_per_worker=resources_per_worker, use_gpu=True
        ),
    )

    search_alg = OptunaSearch(seed=42)

    tuner = Tuner(
        trainable=trainer,
        tune_config=TuneConfig(
            metric="loss",
            mode="min",
            search_alg=search_alg,
            num_samples=4,
        ),
        run_config=RunConfig(stop={"training_iteration": 1}),
        param_space={"train_loop_config": {"lr": choice([0.01, 0.001])}},
    )
    tuner.fit()

I run it with python3, something like python3 hpo.py.

I use hydra for configuration, but I don’t think it’s a problem, all config params have been parsed correctly in each worker).

Can you explain what am I doing wrong and how to fix it?

It seems likely that when using Ray’s TorchTrainer with resources_per_worker={“GPU”: 2}, Ray correctly sets CUDA_VISIBLE_DEVICES to expose two GPUs per worker, but the accelerate.Accelerator still only uses a single process (and thus a single GPU) by default. This is because accelerate expects to be launched in a distributed/multi-process context (e.g., via accelerate launch or torchrun), but Ray’s TorchTrainer runs the train_loop_per_worker in a single process per worker, even if multiple GPUs are visible to that process. As a result, accelerator.num_processes remains 1, and only one GPU is used per worker, despite two being available in CUDA_VISIBLE_DEVICES. This behavior is consistent with Ray’s documented approach for multi-GPU per worker, where you must manually handle device assignment and parallelism within the worker function if you want to use more than one GPU per worker process. The accelerate library does not automatically spawn multiple processes per worker in this context, so it will not use all visible GPUs unless you implement additional logic yourself or use a different launch strategy (Ray Train: Using GPUs, Ray Train: ScalingConfig).

My understanding is that this is a limitation of how Ray integrates with accelerate: Ray’s TorchTrainer does not natively support multi-process (per worker) launches like accelerate expects for multi-GPU usage. If you want to use all GPUs per worker with accelerate, you may need to manually spawn multiple processes within the worker, or use a distributed launch utility outside of Ray, which is not directly supported by Ray’s TorchTrainer API. There is no indication in the data sources that Ray’s TorchTrainer will automatically coordinate accelerate’s multi-GPU/multi-process usage per worker (Ray Train: Using GPUs, Ray GitHub: test_data_parallel_trainer.py).

Would you like more detail on possible workarounds or example code for multi-GPU usage per worker with Ray and accelerate?

Sources:

Hint: Mention @RunLLM in the post for followups.