Train with RLlib using multiple CPU with slrum

gabrigoo · October 18, 2022, 2:42pm

Hello,

I have been trying to use a cluster to train some of my RL environments a bit faster than on my 8 cores computer.

I am using slrum to decide on the resources I need for the training, i.e.

srun --partition=short --export=ALL --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=100G --time=00:30:00 --pty /bin/bash

This usually assigns me 48 cores; therefore, I would expect a big increase in training speed compared to my 8 cores computer. However, the opposite happens when training becomes extremely slow.

These are the configurations I am using for ray:

config[‘framework’] = ‘tf2’
config[‘eager_tracing’] = True
config[‘num_workers’] = num_processes #from multiprocessing.cpu_count()
config[‘num_cpus_per_worker’] = 0
config[‘seed’] = seed
config[‘log_level’] = ‘ERROR’

with

ray.init(num_cpus=multiprocessing.cpu_count(), ignore_reinit_error=True, log_to_driver=False)

For reference I am using a PPO trainer

arturn · November 30, 2022, 7:54pm

num_cpus_per_worker should be 1, otherwise all workers will share the CPU of the driver and make things super slow.

Topic		Replies	Views
Most efficient way to use only a CPU for training RLlib	3	3129	April 22, 2021
How to get the best performance of Ray´s RLlib when running python scripts (using PPO) via a SLURM file on a HPC? Debugging and performance tuning	0	298	December 12, 2023
Reserve workers on GPU node for trainer workers only RLlib	7	1116	June 3, 2022
Specifying overall maximum number of cores to be used in RayTune RLlib	1	783	June 7, 2023
CPU using all cores despite config Configure Algorithm, Training, Evaluation, Scaling	0	17	December 18, 2024

Train with RLlib using multiple CPU with slrum

Related topics