Some Issues When I Start My Ray Cluster in centos 7

Hi
I deployed Ray 1.9.2 on two Centos 7 server with yaml,found some issues as following

  • Issue

    1. if i start ray with my python3 with binary /usr/local/python3/bin/python3, when subbmit script , ray will start my script with python which is the default /usr/bin/python
    2. if my config worker node ip in config not bind to eth0, the worker node will always restart
    3. when stop ray with “ray stop”, because with no parameters, will try to stop default port 6379 , but i set the head port to 26379
  • My Config

cluster_name: default

provider:
    type: local
    head_ip: 172.28.200.59
    worker_ips: [172.28.1.12]

auth:
    ssh_user: ray
    ssh_private_key: ~/.ssh/id_rsa

min_workers: 1
max_workers: 1

upscaling_speed: 1.0
idle_timeout_minutes: 5

file_mounts: {
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=26379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:26379
  • my ENV
    [ray@ml-test ~]$ ip a
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
        link/ether 00:16:3e:0c:35:2f brd ff:ff:ff:ff:ff:ff
        inet 172.16.210.22/24 brd 172.16.210.255 scope global dynamic eth0
           valid_lft 309320647sec preferred_lft 309320647sec
    3: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
        link/ether 36:bb:ef:b5:f2:af brd ff:ff:ff:ff:ff:ff
        inet 172.28.200.59/16 brd 172.28.255.255 scope global ztbtovnmsg
           valid_lft forever preferred_lft forever
    
    [ray@al-bj-ml-prd ~]$ ip a
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
        link/ether 00:16:3e:0a:2d:ae brd ff:ff:ff:ff:ff:ff
        inet 172.16.210.21/24 brd 172.16.210.255 scope global dynamic eth0
           valid_lft 268112162sec preferred_lft 268112162sec
    244: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
        link/ether 36:fb:bf:69:09:f7 brd ff:ff:ff:ff:ff:ff
        inet 172.28.1.12/16 brd 172.28.255.255 scope global ztbtovnmsg
           valid_lft forever preferred_lft forever
    
    [ray@ml-test ~]$ alias
    alias python='python3'
    [ray@ml-test ~]$ which python3
    /usr/local/python3/bin/python3
    
    
    
  • suggestion
    1. alias alias python='python3' can solve issue 1, but not pretty.

      1. if use sys.executable to replace hard code python , maybe better

        [root@localhost ~]# python3
        Python 3.8.12 (default, Sep 19 2021, 21:26:47)
        [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import sys
        >>> sys.executable
        '/usr/local/bin/python3'
        
    2. not use hard code eth0 to start worder node?

    3. add parameter or config to stop sub command, may be better

@raulchen I think you have some suggestions? cc @Dmitri too

  1. There might be some maneuvering you could do with the setup_commands to get the correct behavior, along the lines of ray/defaults.yaml at 1fee0159b4573138284e6902ef484a43350eeda9 · ray-project/ray · GitHub
  2. I don’t know anything about this one.
  3. Yes, Ray stop kills all Ray processes for all Ray clusters running on a node. I think we’ve considered adding an address parameter, but deprioritized it due to technical points with the current implementation of Ray stop – it’s not trivial to do.

about 1
if modify the following line, will solve the problem pretty

[root@ml-test work]# diff /usr/local/python3/lib/python3.7/site-packages/ray/scripts/scripts.py.orig /usr/local/python3/lib/python3.7/site-packages/ray/scripts/scripts.py
1276c1276,1277
<     command_parts = ["python", target]
---
>     # command_parts = ["python", target]
>     command_parts = [sys.executable, target]

when submit get

[ray@ml-test ~]$ ray submit example-full.yaml script.py -v
Loaded cached provider configuration from /tmp/ray-config-c0e056725bcabcdb71cb14638eb1afcf075f2def
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2022-01-28 15:00:43,277 INFO node_provider.py:43 -- ClusterState: Loaded cluster state: ['172.16.210.21', '172.16.210.22']
Fetched IP: 172.16.210.22
Running `rsync --rsh ssh -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/564ad1a02f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore script.py ray@172.16.210.22:~/script.py`
Warning: Permanently added '172.16.210.22' (ECDSA) to the list of known hosts.
sending incremental file list

sent 58 bytes  received 12 bytes  140.00 bytes/sec
total size is 627  speedup is 8.96
`rsync`ed script.py (local) to ~/script.py (remote)
Fetched IP: 172.16.210.22
Running `/usr/local/python3/bin/python3.7 ~/script.py`
This cluster consists of
    2 nodes in total
    8.0 CPU resources in total

(f pid=19102)
Tasks executed
    9248 tasks on 172.16.210.22
    752 tasks on 172.16.210.21
Shared connection to 172.16.210.22 closed.

see
Running /usr/local/python3/bin/python3.7 ~/script.py

about 2
i have no idea too, maybe , because the code as following

the ip i set is bind to NetIf is “ztbtovnmsg”, not in the allow list
but i don’t know the detail of the ray