How do I run my experiment on a single GPU?

I’m currently relying on tune.Tuner to run my experiment on a machine that has 28 CPUs and 2 GPUs. Being that I’m not the only one who has access to this machine, I’d like to restrict my experiment to a single GPU.

Despite specifying with_resources(Trainer, {"cpu": 1, "gpu": 1}) both GPUs are used. The only way to avoid this is by setting os.environ["CUDA_VISIBLE_DEVICES"] = "1".

Is there a way to achieve my goal without explicitly setting an environment variable myself? If I understand correctly, according to the documentation this should be taken care of by tune.with_resources:

To leverage GPUs, you must set gpu in tune.with_resources(trainable, resources_per_trial). This will automatically set CUDA_VISIBLE_DEVICES for each trial.

run_config = RunConfig(
    stop={"training_iteration": epochs},
    checkpoint_config=ck_config,
    name=f"{model_name}_{exp_details}",
    local_dir=str(Path(__file__).parent / "ray_checkpoints")
)

tuner = Tuner(
    trainable= with_resources(Trainer, {"cpu": 1, "gpu": 1}),
    run_config=run_config,
    tune_config=TuneConfig(mode="min", metric="val_loss", num_samples=5),
    param_space=configuration,
)

Hi @mtt,

tune.with_resources sets the resources per trial, and since you have 5 trials with 2 GPUs, Tune will schedule 2 trials at a time, each taking one of the GPUs.

You can specify fractional GPUs per trial, as well as limit concurrency so you never go above some GPU usage. For example, this will run 2 trials concurrently on 1 GPU.

tuner = Tuner(
+   trainable= with_resources(Trainer, {"cpu": 1, "gpu": 0.5}),
    run_config=run_config,
    tune_config=TuneConfig(
        mode="min",
        metric="val_loss",
        num_samples=5,
+       max_concurrent_trials=2,
    ),
    param_space=configuration,
)

See Ray Tune FAQ — Ray 2.3.0 for more info.

Hi, @justinvyu
Could I ask a more question?
if the trainable is a TorchTrainer, I can’t use tune.with_resources, How could I do to use 0.5GPU for one trail?
I test to set in the TorchTrainer:

scaling_config=ScalingConfig(
            num_workers=2s,  
            use_gpu=True,
            resources_per_worker={"GPU":0.5}
        )

However, It didn’t work as expected. How could I do? Thanks in advance

@Xinchengzelin can you create a new topic for this and elaborate a more about what the expected (resource) end state is?

Thanks @matthewdeng, I create a new topic: How to use fraction GPU in `ray.tune.Tuner`?