GPU utilization is only 1%

Hi, I have some quesetion.
the config is

config = {
    "env":"nuplan",
    "env_config":None,
    "num_workers":30,
    # "record_env":False,
    "create_env_on_driver": False,
    "num_envs_per_worker":1,
    "remote_worker_envs": False,
    "num_gpus": 8,
    "num_cpus_per_worker":1,
    "num_gpus_per_worker":0,
    "framework":"torch",
    "model":{
                    "fcnet_hiddens": [512, 512,512,5123],
                   },
    "timesteps_per_iteration":200,
    # "sample_async": True,
    "horizon": 600,
    'rollout_fragment_length':4,#4*30=120
    'train_batch_size':24,
    'replay_buffer_config': 
        {
        "_enable_replay_buffer_api": True,
        "type": "MultiAgentReplayBuffer",
        "learning_starts": 10,
        "capacity": 50000,
        "replay_sequence_length": 1,
        },
    # "training_intensity"=10  train/collect
    "batch_mode": "truncate_episodes",  # 也可以设置 "complete_episodes"  truncate_episodes
    }
    
pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=7200,
    resample_probability=0.25,
    hyperparam_mutations={
    "lr": lambda: random.uniform(1e-3, 5e-5),
    "gamma": lambda: random.uniform(0.90, 0.99),
    },)

and use it by :

if __name__ == "__main__":
    ray.init(num_gpus=8)
    tune.run(
        "DQN", 
        config = cfg,
        scheduler = pbt,
        num_samples = 1,
        metric = "episode_reward_mean",
        mode = "max",
        local_dir = "./results",)

but when i see the GPU util,is only 1% in a GPU and sometimes is always 0, Please help me! Thank you!
the version is 2.0.0 torch==1.9.0

can you help me?
emmmmmmmmm

can somebody can help me…

Please make this a reproducible script.

@wangjunhe8127 Hello, GPU utilization really depends on your workload. You are doing DQN with train_batch_size of 24 which is pretty small, your network is also just an FCNet with 4 layers, so I don’t expect the utilization to be that high anyways. Also you need to set num_gpu to 1 unless you are doing multi-gpu training per tune trial which I don’t think is the case here. Increasing batch_size should result in higher gpu utilization during training. Sampling would be done on cpu workers so during sampling utilization should go down. I hope it helps.

Thank you very much for your reply. However, in fact, I want to use multi GPU training, so why do you suggest that num_ gpus=1 ?

import ray
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining

config = {
    "env":"xxx",
    "env_config":None,
    "num_workers":7,
    "create_env_on_driver": False,
    "num_envs_per_worker":1,
    "remote_worker_envs": False,
    "num_gpus": 3,
    "num_cpus_per_worker":1,
    "num_gpus_per_worker":0,
    "framework":"torch",
    "learning_starts": 20,
    "placement_strategy": "SPREAD",
    "model":{
            "fcnet_hiddens": [512, 512,512,5123],},
    "timesteps_per_iteration":200,
    "horizon": 600,
    'rollout_fragment_length':4,#4*30=120
    'train_batch_size':24,
    'replay_buffer_config': 
        {

        "type": "MultiAgentReplayBuffer",
        "capacity": 50000,
        },
    "batch_mode": "truncate_episodes", 
    }
    
pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=7200,
    resample_probability=0.25,
    hyperparam_mutations={
    "lr": lambda: random.uniform(1e-3, 5e-5),
    "gamma": lambda: random.uniform(0.90, 0.99),
    },)

if __name__ == "__main__":
    ray.init()
    tune.run(
        "DQN", 
        config = config,
        num_samples =4,
        metric = "episode_reward_mean",
        mode = "max",
        local_dir = "./results",
        )
    ray.shutdown()

Thank you! The above is all the settings except the environment, and the version of ray is 1.10.0

Got it. Before I have an i depth look at this, is there a particular reason you are using 1.10? The content of the repro script looks like you could just as well use 2.0. Have you tried that?

Thanks! Beacuse our tool of ws2 is only support v1.10.0,Do you mean it may be a version problem?

RLlib has a lot of moving parts and many things have changed since 1.10.0. I can’t think a particular part that may cause this error though.

You use a very small batch size as @kourosh said.

Here’s a more extensive explanation:
The memory usage of your GPUs looks just fine for what you are doing. But RLlib will split up the training batch (which is already extremely small) between your GPUs. So every GPU will calculate the SGD step for only 3 samples (batch_size=24, num_gpus=8). So for every iteration of the algorithm, you sample (which takes time) and then run very very little data through those 8 GPUs - hence the super small usage.

Ok,Thank you~~~~~~~~~~~~~~~~~~