Hi !
I am using ray for distributing my deep rl APEX implementation. However, I don’t want to use ray’s capabilities to spread the load on several machines. I want to use it on a single machine at a time. The cluster I’m using is running with slurm. At the moment, I simply run n times my script (with ray.init() and everything inside) independently. However, when runing more than 3 times the same script at the same time I notice very strange change in training behaviour. It might be completely unrelated but I have been unable to track it down to something else. Running the full training 15 times sequentially with one script at a time on my local machine or the cluster works just fine.
My question is: could one script/session disturb the others or is this just some fantasy ? If yes, how can I run my script correctly ? I would if possible, the n scripts to be able to run independently (without messing with a head node).
Let’s say that this is my sbatch script.
#
# Slurm arguments
#
#SBATCH --job-name=test1
#SBATCH --export=ALL
#SBATCH --output=logs/19-03-2021--01-27-27___1_stdout.log
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=1000M
#SBATCH --gres=gpu:0
#SBATCH --time=48:00:00
#
conda activate deep_learning
cd /home/bdebes/mt # CHANGEME
python experiments.py -c TD3_double_inverted_pendulum_modifiable.json -l logs/19-03-2021--01-27-27___1.log
and the experiments.py basically does
import ray
ray.init() # that's the actual call I'm using at the moment
train()
Thank you in advance !
Baptiste