Getting DistributedTrainableCreator to train with all GPUs

jlc · May 12, 2022, 8:59pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have 4 GPUs and am trying to do distributed tuning where each tuning experiment uses 2 gpus for distributed training (i.e., 2 tuning experiments running in parallel, using 4 gpus total). However, when I run a simple cifar10 example, it seems to only ever use 2 out of 4 gpus. It seems to only be running one tuning experiment. I think I may be misunderstanding how to allocate num_workers and num_gpus_per_worker.

    trainable_cls = DistributedTrainableCreator(
        train_cifar,
        num_workers=2,
        num_gpus_per_worker=2,
        num_cpus_per_worker=8)

    analysis = tune.run(
        trainable_cls,
        config=config,
        num_samples=4,
        stop={"training_iteration": 10},
        metric="accuracy",
        mode="max")

matthewdeng · May 12, 2022, 9:21pm

Hey @jlc, thanks for the question!

First, I’d like to mention that DistributedTrainableCreators are being deprecated, and I’d encourage you to try Ray Train instead! Ray Train also provides a Ray Tune integration.

Second, if you want each Trial to do distributed training across 2 workers with 1 GPU each, you should set num_gpus_per_worker to 1.

-        num_gpus_per_worker=2,
+        num_gpus_per_worker=1,

Topic		Replies	Views
[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs Ray Tune	4	816	September 20, 2022
Training trials in parallel on multi-gpu machine Ray Tune	8	1700	August 23, 2021
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1647	March 17, 2022
Ray multiprocessing together with distributed learning Ray Train	1	557	March 2, 2022
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	996	August 24, 2021

Getting DistributedTrainableCreator to train with all GPUs

Related topics