Hello,
I have been trying to use a cluster to train some of my RL environments a bit faster than on my 8 cores computer.
I am using slrum to decide on the resources I need for the training, i.e.
srun --partition=short --export=ALL --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=100G --time=00:30:00 --pty /bin/bash
This usually assigns me 48 cores; therefore, I would expect a big increase in training speed compared to my 8 cores computer. However, the opposite happens when training becomes extremely slow.
These are the configurations I am using for ray:
config[‘framework’] = ‘tf2’
config[‘eager_tracing’] = True
config[‘num_workers’] = num_processes #from multiprocessing.cpu_count()
config[‘num_cpus_per_worker’] = 0
config[‘seed’] = seed
config[‘log_level’] = ‘ERROR’
with
ray.init(num_cpus=multiprocessing.cpu_count(), ignore_reinit_error=True, log_to_driver=False)
For reference I am using a PPO trainer