How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi all, im using Ray Train, Tune and Pytorch lightning and running in the following issue:
I have a cluster with 2 nodes, one GPU each, i can run my tune session on it and use both GPU’s using DDP just fine. But im having issues with running two trails concurrently, one on each node. My code:
scaling_config = ray.train.ScalingConfig(use_gpu=True, num_workers=2, resources_per_worker={"CPU":16, "GPU":1})
trainer = TorchTrainer(
partial(tune_func, base_config=cfg),
run_config=ray.train.RunConfig(
name=cfg.experiment_name,
storage_path=os.path.abspath(os.path.join(cfg.paths.ray_storage_path, cfg.model.name)),
callbacks=callbacks,
),
scaling_config=scaling_config,
)
# Use HEBO search
search_space = {
"model.wav2vec2_bundle": tune.choice(["torchaudio.pipelines.WAV2VEC2_BASE", "torchaudio.pipelines.WAV2VEC2_LARGE_LV60K"]),
"model.latent_size": tune.choice([512, 768, 1024]),
"model.layers": tune.choice([1, 2, 3]),
"model.optimizer.weight_decay": tune.loguniform(1e-2, 1.0),
"model.optimizer.lr": tune.loguniform(1e-8, 1e-4),
}
hebo_search = HEBOSearch(metric="val/mse_loss", mode="min")
# Stop non promising trails
scheduler = ASHAScheduler(max_t=1000, grace_period=500, reduction_factor=2)
tuner = tune.Tuner(
trainer,
param_space={"train_loop_config": search_space},
tune_config=tune.TuneConfig(
metric="val/mse_loss",
mode="min",
num_samples=40,
search_alg=hebo_search,
scheduler=scheduler,
max_concurrent_trials=2
),
)
results = tuner.fit()
My torch trainer uses:
strategy=RayDDPStrategy(),
plugins=RayLightningEnvironment()
accelerator="auto"
devices="auto"
When i run it like this, the ray train worker will spin up two worker processes for one trail, even though i set the GPU resources to one:
(RayTrainWorker pid=23774, ip=10.45.48.236) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=23702, ip=10.45.48.236) Started distributed worker processes:
(TorchTrainer pid=23702, ip=10.45.48.236) - (node_id=829a7feb484cbb418e62e96f7646f0d3a9d5a0701f277e90918cb867, ip=10.45.48.236, pid=23774) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=23702, ip=10.45.48.236) - (node_id=7b8756f01d14bfecf4d7bdbbb993284baf1dde5efcd42becf8295d0e, ip=10.45.48.132, pid=20899) world_rank=1, local_rank=0, node_rank=1
(TunerInternal pid=685469) Trial status: 1 RUNNING | 1 PENDING
(TunerInternal pid=685469) Current time: 2025-02-11 10:00:22. Total running time: 2min 30s
(TunerInternal pid=685469) Logical resource usage: 33.0/66 CPUs, 2.0/2 GPUs (0.0/2.0 accelerator_type:A40, 0.0/131072.0 Memory)
(TunerInternal pid=685469) ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
(TunerInternal pid=685469) │ Trial name status ...l.wav2vec2_bundle ...model.latent_size ...nfig/model.layers ...izer.weight_decay ...odel.optimizer.lr │
(TunerInternal pid=685469) ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
(TunerInternal pid=685469) │ TorchTrainer_09e99ee5 RUNNING ...2VEC2_LARGE_LV60K 768 3 0.0215658 2.70306e-05 │
(TunerInternal pid=685469) │ TorchTrainer_95f13e6d PENDING ...nes.WAV2VEC2_BASE 512 1 0.693918 2.4991e-08 │
(TunerInternal pid=685469) ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
If i set num_workers in the scalingconfig to 1, i do see the tuner starting two trails, on the 2 different nodes, but then i get the following error:
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)
It almost looks like the workload is always divided amongst each node, regardless of resources_per_worker which will interfere with each others training sessions
Can anybody point me in the right direction?