Ok, solved by adopting the right redis strategy, my bad, did not add it to the Python script before. The relevant section in the sbatch (e.g. for worker node) are:
#SBATCH --cpus-per-task=40
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
and
redis_password=$(uuidgen)
export redis_passwordthis_node_ip=$(srun --nodes=1 --ntasks=1 -w “$node_i” hostname --ip-address)
srun --nodes=1 --ntasks=1 -w “$node_i”
ray start --address “$ip_head”
–redis-password=“$redis_password”
–node-ip-address=“$this_node_ip”
–num-cpus “${SLURM_CPUS_PER_TASK}” --block &
sleep 10
see ( slurm-basic.sh). And in Python
import os
ray.init(address=“auto”, _redis_password = os.environ[“redis_password”])
see also Ray on SLURM, unmatched Raylet address - #3 by hank7v