Ray Train/Tune issue: concurrent trials conflict on GPU nodes

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi all, im using Ray Train, Tune and Pytorch lightning and running in the following issue:
I have a cluster with 2 nodes, one GPU each, i can run my tune session on it and use both GPU’s using DDP just fine. But im having issues with running two trails concurrently, one on each node. My code:

scaling_config = ray.train.ScalingConfig(use_gpu=True, num_workers=2, resources_per_worker={"CPU":16, "GPU":1})
trainer = TorchTrainer(
    partial(tune_func, base_config=cfg),
    run_config=ray.train.RunConfig(
        name=cfg.experiment_name,
        storage_path=os.path.abspath(os.path.join(cfg.paths.ray_storage_path, cfg.model.name)),
        callbacks=callbacks,
    ),
    scaling_config=scaling_config,
)

# Use HEBO search
search_space = {
    "model.wav2vec2_bundle": tune.choice(["torchaudio.pipelines.WAV2VEC2_BASE", "torchaudio.pipelines.WAV2VEC2_LARGE_LV60K"]),
    "model.latent_size": tune.choice([512, 768, 1024]),
    "model.layers": tune.choice([1, 2, 3]),
    "model.optimizer.weight_decay": tune.loguniform(1e-2, 1.0),
    "model.optimizer.lr": tune.loguniform(1e-8, 1e-4),
}
hebo_search = HEBOSearch(metric="val/mse_loss", mode="min")

# Stop non promising trails
scheduler = ASHAScheduler(max_t=1000, grace_period=500, reduction_factor=2)

tuner = tune.Tuner(
    trainer,
    param_space={"train_loop_config": search_space},
    tune_config=tune.TuneConfig(
        metric="val/mse_loss",
        mode="min",
        num_samples=40,
        search_alg=hebo_search,
        scheduler=scheduler,
        max_concurrent_trials=2
    ),
)

results = tuner.fit()

My torch trainer uses:

strategy=RayDDPStrategy(),
plugins=RayLightningEnvironment()
accelerator="auto"
devices="auto"

When i run it like this, the ray train worker will spin up two worker processes for one trail, even though i set the GPU resources to one:

(RayTrainWorker pid=23774, ip=10.45.48.236) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=23702, ip=10.45.48.236) Started distributed worker processes: 
(TorchTrainer pid=23702, ip=10.45.48.236) - (node_id=829a7feb484cbb418e62e96f7646f0d3a9d5a0701f277e90918cb867, ip=10.45.48.236, pid=23774) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=23702, ip=10.45.48.236) - (node_id=7b8756f01d14bfecf4d7bdbbb993284baf1dde5efcd42becf8295d0e, ip=10.45.48.132, pid=20899) world_rank=1, local_rank=0, node_rank=1
(TunerInternal pid=685469) Trial status: 1 RUNNING | 1 PENDING
(TunerInternal pid=685469) Current time: 2025-02-11 10:00:22. Total running time: 2min 30s
(TunerInternal pid=685469) Logical resource usage: 33.0/66 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:A40, 0.0/131072.0 Memory)
(TunerInternal pid=685469) ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
(TunerInternal pid=685469) │ Trial name              status     ...l.wav2vec2_bundle       ...model.latent_size     ...nfig/model.layers     ...izer.weight_decay     ...odel.optimizer.lr │
(TunerInternal pid=685469) ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
(TunerInternal pid=685469) │ TorchTrainer_09e99ee5   RUNNING    ...2VEC2_LARGE_LV60K                        768                        3                0.0215658              2.70306e-05 │
(TunerInternal pid=685469) │ TorchTrainer_95f13e6d   PENDING    ...nes.WAV2VEC2_BASE                        512                        1                0.693918               2.4991e-08  │
(TunerInternal pid=685469) ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

If i set num_workers in the scalingconfig to 1, i do see the tuner starting two trails, on the 2 different nodes, but then i get the following error:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)

It almost looks like the workload is always divided amongst each node, regardless of resources_per_worker which will interfere with each others training sessions
Can anybody point me in the right direction?

Hi Julian! Ty for your question! I did some research in the docs and hopefully this will help.

So, first off, the way your resources are allocated might be a bit off. Your current setup uses two workers per trial, which can spread across the nodes and might lead to conflicts. If you want each trial to run on a single node with one GPU, you might try setting num_workers=1. But then, you’ll need to adjust your strategy to handle unused parameters, which PyTorch often flags when you switch to one worker. For this, adjusting your strategy with DDPStrategy(find_unused_parameters=True) in your TorchTrainer should help.

Also, make sure your cluster has enough resources to support running two trials at once. Your setting of max_concurrent_trials=2 seems right for this purpose, but double-check that your nodes have the required GPUs and CPUs. How many do you have available?

You might also want to look into using placement groups; they help in allocating resources more effectively and avoid any unwanted competition between trials. Look into your cluster configuration as well to see if it aligns with your trial requirements, as each node should ideally match what’s needed, especially with the GPUs. :slight_smile:

The error you’re facing with the LightningModule, I think it has to do with unused parameters, but I’m not sure if that has to do specifically with resource allocation issues.

Relevant docs:

I managed to get it working by removing the use of the Ray torch trainer and just using tune:

tuner = tune.Tuner(
        tune.with_resources(partial(tune_func, base_config=cfg), {"cpu": 16, "gpu": 1}),
        param_space=search_space,
        tune_config=tune.TuneConfig(
            metric="val/mse_loss",
            mode="min",
            num_samples=40,
            search_alg=hebo_search,
            scheduler=scheduler,
            max_concurrent_trials=2
        ),
        run_config=ray.train.RunConfig(
            name=cfg.experiment_name,
            storage_path=os.path.abspath(os.path.join(cfg.paths.ray_storage_path, cfg.model.name)),
        ),
    )

I also removed RayDDPStrategy and RayLightningEnvironment from the Lightning trainer inside the train function.

Its kind of weird, the docs at Using PyTorch Lightning with Tune — Ray 2.42.1 mention using multiple workers but limiting the amount of recources per worker as i did. Setting find_unused_parameters adds extra overhead, and shouldnt be required for this purpose