Training parallelisation in RLLIB

I am training some custom environment on a local cluster with 3 machines. During training i made the observation that only the sample collection seems to be distributed. During sample collection my head node idles and the other two workers are fully used. During the training process it is the other way around. From what i saw in the docs this seems normal behavior.

Is there way to fully utilize all resources in both steps? I guess this would probably depend on how the learning algorithms works, but shouldn’t it be possible to train on all machines and combine the results? Thanks in advance for any clarification on this topic.

For refrence here is parts of my config:

config = {
    "env": my_env,
    "env_config": my_env_config,
    "multiagent": {
        ...
    },
    "num_workers": 2,
    # DL framework to use.
    "framework": "torch",
    "num_cpus_for_driver": 16,
    "num_cpus_per_worker": 16,
    "num_gpus": 1,
    "num_gpus_per_worker": 1,
    "disable_env_checking": True,
    "train_batch_size": 4000
}

Hi @Blubberblub ,

This is normal behaviour if you don’t collect samples on the driver tread and that is RLlib’s only one on the head node. RolloutWorkers should be spawned throughout your cluster to collect samples, also on your head node.

Your Algorithm will always run in the main thread or it’s own thread when being scheduled by Ray.tune and only participate in sampling if there are now worker for this job.
You can however create workers on the head node that maximize utilization of the head (save for the single CPU used by Algorithm).

1 Like

@arturn Thanks for the clarification! If i understand correctly i can sample on the head node as well but can’t train on the head node while the other nodes are sampling. Seems like a waste of time for the processes of sampling and training to run in turns. I remember a NVIDIA talk from GTC last year where they talked about “offsetting” training and sample generation since their simulation(env) took around 1 hour to calculate. But maybe we just need to upgrade our head node… :grinning:

You can sample on the head node, it’s only the driver thread that can’t sample in that case.
So it will only be the one CPU core that is “in charge” of orchestration of the Algorithm/Experiment you are running.
So if your head node has 16 cores, 15 can be used for sampling.
I am not familiar with the Nvidia talk, but if you feel like we are missing out on resource efficiency concrete suggestions are always welcome :slight_smile: