Different hardware usage of rollout-workers during sampling on cluster

Blubberblub · March 2, 2023, 3:25pm

Im training some PPO policies with a custom env on a small cluster with 1 head node and 3 nodes for rollout workers. For some reason one of my 3 rollout worker nodes doesn’t seem to use the gpu during sampling and has increased cpu usage (see the attached screenshot). As my env uses some rendering software that can use either gpu or cpu or both it seems that ray somehow prevents the env from using the GPU and the node switches to rendering on CPU instead.

My config for ressources and rollout looks like this:

train_config = PPOConfig()\
    .resources(
        num_gpus=0.5,
        num_gpus_per_learner_worker=1.0,
        num_gpus_per_worker=1.0,
        num_cpus_per_worker=8,
        placement_strategy="SPREAD"
    )\
    .rollouts(
        num_rollout_workers=3,
        num_envs_per_worker=1,
    )

Is my config wrong? I’d be happy if someone could help explain what could be wrong here!

Edit: Corrected the screenshot showing the situation

Blubberblub · March 6, 2023, 9:13am

I found a somewhat “working” solution with this config:

.resources(
        num_gpus=0.1,
        num_gpus_per_learner_worker=0.8,
        num_gpus_per_worker=1.0,
        num_cpus_per_worker=16,
    )\
    .rollouts(
        num_rollout_workers=2,
        num_envs_per_worker=1,
    )\

It seems when i don’t specify num_gpus and num_gpus_per_learner_worker PPO.train is not using any GPU. However if i set num_gpus to 1.0 the PPO.train is move from the head node to another node (which i don’t want). Is there any documentation to understand how ray shifts the workloads depending on the arguments in rollouts and ressources?

Topic		Replies	Views
Total Workers == (Number of GPUS) - 1? Configure Algorithm, Training, Evaluation, Scaling	1	1183	February 9, 2023
Num_gpu, rollout_workers, learner_workers, evaluation_workers purpose + resource allocation Configure Algorithm, Training, Evaluation, Scaling	8	2056	August 24, 2023
Reserve workers on GPU node for trainer workers only RLlib	7	1111	June 3, 2022
Run PPO on multiple nodes RLlib	1	599	September 4, 2022
PPO configuration parameters: num_rollout_workers & train_batch_size Configure Algorithm, Training, Evaluation, Scaling	1	748	November 2, 2023

Different hardware usage of rollout-workers during sampling on cluster

Related topics