- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have recently been training a custom environment locally for testing but have since scaled it up to a VM enabled with an NVIDIA T4 GPU.
My code to setup training looks like this:
if __name__ == "__main__":
available_cpus = multiprocessing.cpu_count()
available_gpus = torch.cuda.device_count()
sample_env = SatelliteSensorEnv()
config = (
PPOConfig()
.environment(SatelliteSensorEnv)
.multi_agent(policies={
"satellite": (
None,
sample_env.observation_space["satellite"],
sample_env.action_space["satellite"],
PPOConfig.overrides(framework_str="tf")
),
"sensor": (
None,
sample_env.observation_space["sensor"],
sample_env.action_space["sensor"],
PPOConfig.overrides(framework_str="tf")
)},
policy_mapping_fn=(lambda agent_id, *args, **kwargs: agent_id))
.env_runners(batch_mode="complete_episodes",
num_env_runners = available_cpus - 1,
num_cpus_per_env_runner=1)
.resources(num_gpus=available_gpus)
.learners(num_learners=1, num_gpus_per_learner=1)
.evaluation(evaluation_interval=10, evaluation_duration=10, evaluation_duration_unit='episodes', custom_evaluation_function=custom_eval_function)
)
tuner = tune.Tuner(
"PPO",
param_space=config.to_dict(),
run_config=train.RunConfig(
stop={"training_iteration":1000},
checkpoint_config=train.CheckpointConfig(num_to_keep=5,
checkpoint_score_attribute="episode_reward_mean",
checkpoint_score_order="max",
checkpoint_frequency=10,
checkpoint_at_end=True),
storage_path=f'{os.path.dirname(os.path.abspath(__file__))}/Results')
)
results = tuner.fit()
ray.shutdown()
I’ve tried to run this with both the PyTorch and TF as my framework but have run into the same issue below.
Trial status: 1 RUNNING
Current time: 2024-07-31 16:27:18. Total running time: 3min 0s
Logical resource usage: 16.0/16 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:T4)
For whatever reason, I can’t seem to get the GPU acceleration to activate and as such, the training is just about as fast with a CPU as it with GPU. From an nvidia-smi
call the results show how little of the GPU is actually being used.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 529.19 Driver Version: 529.19 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 TCC | 00000001:00:00.0 Off | 0 |
| N/A 48C P0 29W / 70W | 801MiB / 15360MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 17776 C ....conda\envs\rl\python.exe 800MiB |
+-----------------------------------------------------------------------------+
I’m hoping someone might have a fix for this so that I can use the GPU effectively. Thanks!