Hi
I deployed Ray 1.9.2 on two Centos 7 server with yaml,found some issues as following
Issue
if i start ray with my python3 with binary /usr/local/python3/bin/python3, when subbmit script , ray will start my script with python which is the default /usr/bin/python
if my config worker node ip in config not bind to eth0, the worker node will always restart
when stop ray with “ray stop”, because with no parameters, will try to stop default port 6379 , but i set the head port to 26379
[ray@ml-test ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:16:3e:0c:35:2f brd ff:ff:ff:ff:ff:ff
inet 172.16.210.22/24 brd 172.16.210.255 scope global dynamic eth0
valid_lft 309320647sec preferred_lft 309320647sec
3: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
link/ether 36:bb:ef:b5:f2:af brd ff:ff:ff:ff:ff:ff
inet 172.28.200.59/16 brd 172.28.255.255 scope global ztbtovnmsg
valid_lft forever preferred_lft forever
[ray@al-bj-ml-prd ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:16:3e:0a:2d:ae brd ff:ff:ff:ff:ff:ff
inet 172.16.210.21/24 brd 172.16.210.255 scope global dynamic eth0
valid_lft 268112162sec preferred_lft 268112162sec
244: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
link/ether 36:fb:bf:69:09:f7 brd ff:ff:ff:ff:ff:ff
inet 172.28.1.12/16 brd 172.28.255.255 scope global ztbtovnmsg
valid_lft forever preferred_lft forever
[ray@ml-test ~]$ alias
alias python='python3'
[ray@ml-test ~]$ which python3
/usr/local/python3/bin/python3
suggestion
alias alias python='python3' can solve issue 1, but not pretty.
if use sys.executable to replace hard code python , maybe better
[root@localhost ~]# python3
Python 3.8.12 (default, Sep 19 2021, 21:26:47)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.executable
'/usr/local/bin/python3'
not use hard code eth0 to start worder node?
add parameter or config to stop sub command, may be better
Yes, Ray stop kills all Ray processes for all Ray clusters running on a node. I think we’ve considered adding an address parameter, but deprioritized it due to technical points with the current implementation of Ray stop – it’s not trivial to do.
[ray@ml-test ~]$ ray submit example-full.yaml script.py -v
Loaded cached provider configuration from /tmp/ray-config-c0e056725bcabcdb71cb14638eb1afcf075f2def
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2022-01-28 15:00:43,277 INFO node_provider.py:43 -- ClusterState: Loaded cluster state: ['172.16.210.21', '172.16.210.22']
Fetched IP: 172.16.210.22
Running `rsync --rsh ssh -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/564ad1a02f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore script.py ray@172.16.210.22:~/script.py`
Warning: Permanently added '172.16.210.22' (ECDSA) to the list of known hosts.
sending incremental file list
sent 58 bytes received 12 bytes 140.00 bytes/sec
total size is 627 speedup is 8.96
`rsync`ed script.py (local) to ~/script.py (remote)
Fetched IP: 172.16.210.22
Running `/usr/local/python3/bin/python3.7 ~/script.py`
This cluster consists of
2 nodes in total
8.0 CPU resources in total
(f pid=19102)
Tasks executed
9248 tasks on 172.16.210.22
752 tasks on 172.16.210.21
Shared connection to 172.16.210.22 closed.
see
Running /usr/local/python3/bin/python3.7 ~/script.py