GPU Acceleration 0.0/1.0

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have recently been training a custom environment locally for testing but have since scaled it up to a VM enabled with an NVIDIA T4 GPU.

My code to setup training looks like this:

if __name__ == "__main__":

    available_cpus = multiprocessing.cpu_count()
    available_gpus = torch.cuda.device_count()
    
    sample_env = SatelliteSensorEnv()
    
    config = (
        PPOConfig()
        .environment(SatelliteSensorEnv)
        .multi_agent(policies={
                        "satellite": (
                            None,
                            sample_env.observation_space["satellite"],
                            sample_env.action_space["satellite"],
                            PPOConfig.overrides(framework_str="tf")
                        ),
                        "sensor":     (
                            None,
                            sample_env.observation_space["sensor"],
                            sample_env.action_space["sensor"],
                            PPOConfig.overrides(framework_str="tf")
                        )},    
                policy_mapping_fn=(lambda agent_id, *args, **kwargs: agent_id))
        .env_runners(batch_mode="complete_episodes",
                      num_env_runners = available_cpus - 1,
                      num_cpus_per_env_runner=1)
        .resources(num_gpus=available_gpus)
        .learners(num_learners=1, num_gpus_per_learner=1)
        .evaluation(evaluation_interval=10, evaluation_duration=10, evaluation_duration_unit='episodes', custom_evaluation_function=custom_eval_function)
    )
    
    tuner = tune.Tuner(
        "PPO",
        param_space=config.to_dict(),
        run_config=train.RunConfig(
            stop={"training_iteration":1000},
            checkpoint_config=train.CheckpointConfig(num_to_keep=5,
                               checkpoint_score_attribute="episode_reward_mean",
                               checkpoint_score_order="max",
                               checkpoint_frequency=10,
                               checkpoint_at_end=True),
            storage_path=f'{os.path.dirname(os.path.abspath(__file__))}/Results')
    )
    
    results = tuner.fit()
    
    ray.shutdown()

I’ve tried to run this with both the PyTorch and TF as my framework but have run into the same issue below.

Trial status: 1 RUNNING
Current time: 2024-07-31 16:27:18. Total running time: 3min 0s
Logical resource usage: 16.0/16 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:T4)

For whatever reason, I can’t seem to get the GPU acceleration to activate and as such, the training is just about as fast with a CPU as it with GPU. From an nvidia-smi call the results show how little of the GPU is actually being used.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 529.19       Driver Version: 529.19       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4           TCC   | 00000001:00:00.0 Off |                    0 |
| N/A   48C    P0    29W /  70W |    801MiB / 15360MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17776      C   ....conda\envs\rl\python.exe      800MiB |
+-----------------------------------------------------------------------------+

I’m hoping someone might have a fix for this so that I can use the GPU effectively. Thanks!