Training parallelisation in RLLIB

Blubberblub · November 17, 2022, 9:45am

I am training some custom environment on a local cluster with 3 machines. During training i made the observation that only the sample collection seems to be distributed. During sample collection my head node idles and the other two workers are fully used. During the training process it is the other way around. From what i saw in the docs this seems normal behavior.

Is there way to fully utilize all resources in both steps? I guess this would probably depend on how the learning algorithms works, but shouldn’t it be possible to train on all machines and combine the results? Thanks in advance for any clarification on this topic.

For refrence here is parts of my config:

config = {
    "env": my_env,
    "env_config": my_env_config,
    "multiagent": {
        ...
    },
    "num_workers": 2,
    # DL framework to use.
    "framework": "torch",
    "num_cpus_for_driver": 16,
    "num_cpus_per_worker": 16,
    "num_gpus": 1,
    "num_gpus_per_worker": 1,
    "disable_env_checking": True,
    "train_batch_size": 4000
}

arturn · November 30, 2022, 9:05pm

Hi @Blubberblub ,

This is normal behaviour if you don’t collect samples on the driver tread and that is RLlib’s only one on the head node. RolloutWorkers should be spawned throughout your cluster to collect samples, also on your head node.

Your Algorithm will always run in the main thread or it’s own thread when being scheduled by Ray.tune and only participate in sampling if there are now worker for this job.
You can however create workers on the head node that maximize utilization of the head (save for the single CPU used by Algorithm).

Blubberblub · December 5, 2022, 7:28am

@arturn Thanks for the clarification! If i understand correctly i can sample on the head node as well but can’t train on the head node while the other nodes are sampling. Seems like a waste of time for the processes of sampling and training to run in turns. I remember a NVIDIA talk from GTC last year where they talked about “offsetting” training and sample generation since their simulation(env) took around 1 hour to calculate. But maybe we just need to upgrade our head node…

arturn · December 9, 2022, 9:22pm

You can sample on the head node, it’s only the driver thread that can’t sample in that case.
So it will only be the one CPU core that is “in charge” of orchestration of the Algorithm/Experiment you are running.
So if your head node has 16 cores, 15 can be used for sampling.
I am not familiar with the Nvidia talk, but if you feel like we are missing out on resource efficiency concrete suggestions are always welcome

Topic		Replies	Views
Collect samples on a remote server train on local RLlib	5	694	April 27, 2021
How to speedup RLLIB training RLlib	1	505	June 1, 2021
Reserve workers on GPU node for trainer workers only RLlib	7	1121	June 3, 2022
Most efficient way to use only a CPU for training RLlib	3	3157	April 22, 2021
Required resources should be shared between train and eval workers RLlib	5	526	March 31, 2021

Training parallelisation in RLLIB

Related topics