Train with RLlib using multiple CPU with slrum


I have been trying to use a cluster to train some of my RL environments a bit faster than on my 8 cores computer.

I am using slrum to decide on the resources I need for the training, i.e.

srun --partition=short --export=ALL --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=100G --time=00:30:00 --pty /bin/bash

This usually assigns me 48 cores; therefore, I would expect a big increase in training speed compared to my 8 cores computer. However, the opposite happens when training becomes extremely slow.

These are the configurations I am using for ray:

config[‘framework’] = ‘tf2’
config[‘eager_tracing’] = True
config[‘num_workers’] = num_processes #from multiprocessing.cpu_count()
config[‘num_cpus_per_worker’] = 0
config[‘seed’] = seed
config[‘log_level’] = ‘ERROR’


ray.init(num_cpus=multiprocessing.cpu_count(), ignore_reinit_error=True, log_to_driver=False)

For reference I am using a PPO trainer

num_cpus_per_worker should be 1, otherwise all workers will share the CPU of the driver and make things super slow.