I am using ray for distributing my deep rl APEX implementation. However, I don’t want to use ray’s capabilities to spread the load on several machines. I want to use it on a single machine at a time. The cluster I’m using is running with slurm. At the moment, I simply run n times my script (with ray.init() and everything inside) independently. However, when runing more than 3 times the same script at the same time I notice very strange change in training behaviour. It might be completely unrelated but I have been unable to track it down to something else. Running the full training 15 times sequentially with one script at a time on my local machine or the cluster works just fine.
My question is: could one script/session disturb the others or is this just some fantasy ? If yes, how can I run my script correctly ? I would if possible, the n scripts to be able to run independently (without messing with a head node).
Let’s say that this is my sbatch script.
# # Slurm arguments # #SBATCH --job-name=test1 #SBATCH --export=ALL #SBATCH --output=logs/19-03-2021--01-27-27___1_stdout.log #SBATCH --cpus-per-task=20 #SBATCH --mem-per-cpu=1000M #SBATCH --gres=gpu:0 #SBATCH --time=48:00:00 # conda activate deep_learning cd /home/bdebes/mt # CHANGEME python experiments.py -c TD3_double_inverted_pendulum_modifiable.json -l logs/19-03-2021--01-27-27___1.log
and the experiments.py basically does
import ray ray.init() # that's the actual call I'm using at the moment train()
Thank you in advance !