[Slurm] Proper way to launch the same script on n independent nodes

Hi !

I am using ray for distributing my deep rl APEX implementation. However, I don’t want to use ray’s capabilities to spread the load on several machines. I want to use it on a single machine at a time. The cluster I’m using is running with slurm. At the moment, I simply run n times my script (with ray.init() and everything inside) independently. However, when runing more than 3 times the same script at the same time I notice very strange change in training behaviour. It might be completely unrelated but I have been unable to track it down to something else. Running the full training 15 times sequentially with one script at a time on my local machine or the cluster works just fine.

My question is: could one script/session disturb the others or is this just some fantasy ? If yes, how can I run my script correctly ? I would if possible, the n scripts to be able to run independently (without messing with a head node).

Let’s say that this is my sbatch script.

#
# Slurm arguments
#
#SBATCH --job-name=test1 
#SBATCH --export=ALL                
#SBATCH --output=logs/19-03-2021--01-27-27___1_stdout.log   
#SBATCH --cpus-per-task=20          
#SBATCH --mem-per-cpu=1000M            
#SBATCH --gres=gpu:0              
#SBATCH --time=48:00:00             
#

conda activate deep_learning

cd /home/bdebes/mt  # CHANGEME
python experiments.py -c TD3_double_inverted_pendulum_modifiable.json -l logs/19-03-2021--01-27-27___1.log

and the experiments.py basically does

import ray
ray.init() # that's the actual call I'm using at the moment
train()

Thank you in advance !

Baptiste

Hmmm, I think there shouldn’t be any disturbance. is it possible that they are running on the same node?

maybe you’ll have to use ray.init(num_cpus=20) to reduce the parallelism.