How do I run my experiment on a single GPU?

mtt · March 8, 2023, 2:54pm

I’m currently relying on tune.Tuner to run my experiment on a machine that has 28 CPUs and 2 GPUs. Being that I’m not the only one who has access to this machine, I’d like to restrict my experiment to a single GPU.

Despite specifying with_resources(Trainer, {"cpu": 1, "gpu": 1}) both GPUs are used. The only way to avoid this is by setting os.environ["CUDA_VISIBLE_DEVICES"] = "1".

Is there a way to achieve my goal without explicitly setting an environment variable myself? If I understand correctly, according to the documentation this should be taken care of by tune.with_resources:

To leverage GPUs, you must set gpu in tune.with_resources(trainable, resources_per_trial). This will automatically set CUDA_VISIBLE_DEVICES for each trial.

run_config = RunConfig(
    stop={"training_iteration": epochs},
    checkpoint_config=ck_config,
    name=f"{model_name}_{exp_details}",
    local_dir=str(Path(__file__).parent / "ray_checkpoints")
)

tuner = Tuner(
    trainable= with_resources(Trainer, {"cpu": 1, "gpu": 1}),
    run_config=run_config,
    tune_config=TuneConfig(mode="min", metric="val_loss", num_samples=5),
    param_space=configuration,
)

justinvyu · March 8, 2023, 5:32pm

Hi @mtt,

tune.with_resources sets the resources per trial, and since you have 5 trials with 2 GPUs, Tune will schedule 2 trials at a time, each taking one of the GPUs.

You can specify fractional GPUs per trial, as well as limit concurrency so you never go above some GPU usage. For example, this will run 2 trials concurrently on 1 GPU.

tuner = Tuner(
+   trainable= with_resources(Trainer, {"cpu": 1, "gpu": 0.5}),
    run_config=run_config,
    tune_config=TuneConfig(
        mode="min",
        metric="val_loss",
        num_samples=5,
+       max_concurrent_trials=2,
    ),
    param_space=configuration,
)

See Ray Tune FAQ — Ray 2.3.0 for more info.

Xinchengzelin · August 18, 2023, 12:35pm

Hi, @justinvyu
Could I ask a more question?
if the trainable is a TorchTrainer, I can’t use tune.with_resources, How could I do to use 0.5GPU for one trail?
I test to set in the TorchTrainer：

scaling_config=ScalingConfig(
            num_workers=2s,  
            use_gpu=True,
            resources_per_worker={"GPU":0.5}
        )

However, It didn’t work as expected. How could I do? Thanks in advance

matthewdeng · August 19, 2023, 4:21pm

@Xinchengzelin can you create a new topic for this and elaborate a more about what the expected (resource) end state is?

Xinchengzelin · August 20, 2023, 12:53am

Thanks @matthewdeng, I create a new topic: How to use fraction GPU in `ray.tune.Tuner`?

Topic		Replies	Views
How to use fraction GPU in `ray.tune.Tuner`? Ray Train	6	1160	August 24, 2023
Multiple trials on each GPU Ray Tune	1	487	February 19, 2021
Using specific GPUs in a shared machine Ray Tune	6	2926	March 24, 2022
Different trial on CPU and GPU separately? Ray Tune	8	551	April 4, 2022
Using fractional GPU with TorchTrainer and Tuner API	3	930	August 22, 2023

How do I run my experiment on a single GPU?

Related topics