Hi @peytondmurray,
Ray 2.1, and I use the same call both manually (SSH) and in the sbatch. Other than --block option.
About IP. All the nodes have 3 interfaces. Here in the batch, for the worker node
srun --nodes=1 --ntasks=1 -w “$node_i” ifconfig
and here is the output for that worker
srun --nodes=1 --ntasks=1 -w ccs0133 ifconfig
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.2.2.133 netmask 255.255.0.0 broadcast 172.2.255.255
inet6 fe80::ee0d:9a03:89:176a prefixlen 64 scopeid 0x20
infiniband 00:00:10:91:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 117123421 bytes 54477566218 (50.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 166538971 bytes 123640462406 (115.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Local Loopback)
RX packets 17229287 bytes 37506789512 (34.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 17229287 bytes 37506789512 (34.9 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
p3p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.212.2.133 netmask 255.255.248.0 broadcast 10.212.7.255
inet6 fe80::3efd:feff:fe56:70e0 prefixlen 64 scopeid 0x20
ether 3c:fd:fe:56:70:e0 txqueuelen 1000 (Ethernet)
RX packets 29004104231 bytes 34277022788689 (31.1 TiB)
RX errors 0 dropped 26928 overruns 0 frame 0
TX packets 29000446008 bytes 34221473313846 (31.1 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I tried all the interfaces. This is, for the worker node, the version where I used the ib0 interface, although in the beginning, I was using the std one return by hostname --ip-address, which was the p3p1 one.
redis_password=$(uuidgen)
export redis_password
this_node_ip=$(srun --nodes=1 --ntasks=1 -w “$node_i” ip -o -4 addr list ib0 | awk ‘{print $4}’ | cut -d/ -f1)
srun --nodes=1 --ntasks=1 -w “$node_i”
ray start --address “$ip_head”
–redis-password=“$redis_password”
–node-ip-address=“$this_node_ip”
–num-cpus “${SLURM_CPUS_PER_TASK}” --block &
sleep 30
and this is the output I get:
srun --nodes=1 --ntasks=1 -w ccs0133 ray start --address 172.2.2.117:6379 --redis-password=dbdf4009-67f8-482a-8ccd-6e9fbb81a4a4 --node-ip-address=172.2.2.133 --num-cpus 40 --block
[2022-12-01 18:44:59,095 I 265923 265923] global_state_accessor.cc:357: This node has an IP address of 172.2.2.133, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2022-12-01 18:44:59,044 INFO scripts.py:883 – Local node IP: 172.2.2.133
2022-12-01 18:44:59,097 SUCC scripts.py:895 – --------------------
2022-12-01 18:44:59,097 SUCC scripts.py:896 – Ray runtime started.
2022-12-01 18:44:59,097 SUCC scripts.py:897 – --------------------
2022-12-01 18:44:59,097 INFO scripts.py:899 – To terminate the Ray runtime, run
2022-12-01 18:44:59,097 INFO scripts.py:900 – ray stop
2022-12-01 18:44:59,097 INFO scripts.py:905 – --block
2022-12-01 18:44:59,097 INFO scripts.py:906 – This command will now block forever until terminated by a signal.
2022-12-01 18:44:59,097 INFO scripts.py:909 – Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
UPDATE: previously, I didn’t set correctly redis in the python script, I was using
ray.init()
It worked when used by SSH, but when using the batch, the number of available CPUs was always 40. By replacing it with the correct code
ray.init(address=“auto”, _redis_password = os.environ[“redis_password”])
two nodes (2 x 40 CPUs) are recognised when running the batch as well. Nevertheless, the ERROR persists in the case of the batch. In the case of manually handling by SSH instead, the “ray start” command on the worker does not return any error, or it returns an error the first time, but it works fine when repeating the call. Maybe I should not care about the error message because it seems to work now (?)
thanks a lot for digging into it,
Best
Fabrizio