I set the ray:
ray.init(num_gpus=1)
results = tune.Tuner(
args.run,
param_space=config,
run_config=air.RunConfig(
stop=stop,
verbose=2,
checkpoint_config=air.CheckpointConfig(checkpoint_at_end=True),
),
).fit()
but the result shows that i can not use the GPU, why:
Logical resource usage: 36.0/104 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:RTX)
Hi @Keik1999, it is hard to give an opinion without having a reproducable example. Could you provide one?
thanks for your kind help. i figure out this problem today with my colleagues in the research group. I found that ray allocates the entire gpu to one cpu by default if we use the code:
.resources(num_gpus=1)
but if i use tune to run different parameters, this code will be useless and ray will run in parallel on multiple CPUs.
so i need to use:
.resources(num_gpus_per_worker=0.05)
then everything will be ok!
Logical resource usage: 32.0/256 CPUs, 0.8000000000000002/1 GPUs
by the way i use ray==2.8
1 Like
Great that you found the solution yourself. Note,num_gpus_per_worker
is usually only needed if inference is expensive or your env can use GPUs. Most effective is usually the num_gpus_per_learner_worker
if _enable_new_api_stack=True
. If the latter is False
num_gpus
defines the number of GPUs for the local worker where usually training happens.