Some Issues When I Start My Ray Cluster in centos 7

wuyeguo · January 28, 2022, 3:56am

Hi
I deployed Ray 1.9.2 on two Centos 7 server with yaml，found some issues as following

Issue
1. if i start ray with my python3 with binary /usr/local/python3/bin/python3， when subbmit script ， ray will start my script with python which is the default /usr/bin/python
2. if my config worker node ip in config not bind to eth0， the worker node will always restart
3. when stop ray with “ray stop”, because with no parameters， will try to stop default port 6379 ， but i set the head port to 26379
My Config

cluster_name: default

provider:
    type: local
    head_ip: 172.28.200.59
    worker_ips: [172.28.1.12]

auth:
    ssh_user: ray
    ssh_private_key: ~/.ssh/id_rsa

min_workers: 1
max_workers: 1

upscaling_speed: 1.0
idle_timeout_minutes: 5

file_mounts: {
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=26379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:26379

my ENV

[ray@ml-test ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:16:3e:0c:35:2f brd ff:ff:ff:ff:ff:ff
    inet 172.16.210.22/24 brd 172.16.210.255 scope global dynamic eth0
       valid_lft 309320647sec preferred_lft 309320647sec
3: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 36:bb:ef:b5:f2:af brd ff:ff:ff:ff:ff:ff
    inet 172.28.200.59/16 brd 172.28.255.255 scope global ztbtovnmsg
       valid_lft forever preferred_lft forever

[ray@al-bj-ml-prd ~]$ ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:16:3e:0a:2d:ae brd ff:ff:ff:ff:ff:ff
    inet 172.16.210.21/24 brd 172.16.210.255 scope global dynamic eth0
       valid_lft 268112162sec preferred_lft 268112162sec
244: ztbtovnmsg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 36:fb:bf:69:09:f7 brd ff:ff:ff:ff:ff:ff
    inet 172.28.1.12/16 brd 172.28.255.255 scope global ztbtovnmsg
       valid_lft forever preferred_lft forever

[ray@ml-test ~]$ alias
alias python='python3'
[ray@ml-test ~]$ which python3
/usr/local/python3/bin/python3

suggestion

alias alias python='python3' can solve issue 1, but not pretty.

if use sys.executable to replace hard code python , maybe better

[root@localhost ~]# python3
Python 3.8.12 (default, Sep 19 2021, 21:26:47)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.executable
'/usr/local/bin/python3'

not use hard code eth0 to start worder node?
add parameter or config to stop sub command, may be better

zhz · January 28, 2022, 4:22am

@raulchen I think you have some suggestions? cc @Dmitri too

Dmitri · January 28, 2022, 5:38am

There might be some maneuvering you could do with the setup_commands to get the correct behavior, along the lines of ray/defaults.yaml at 1fee0159b4573138284e6902ef484a43350eeda9 · ray-project/ray · GitHub
I don’t know anything about this one.
Yes, Ray stop kills all Ray processes for all Ray clusters running on a node. I think we’ve considered adding an address parameter, but deprioritized it due to technical points with the current implementation of Ray stop – it’s not trivial to do.

wuyeguo · January 28, 2022, 7:06am

about 1
if modify the following line， will solve the problem pretty

github.com

ray-project/ray/blob/ef593fe5d3c864836b80ae77be32635cef42b537/python/ray/scripts/scripts.py#L1276

      
        
            target = os.path.basename(script)
            target = os.path.join("~", target)
            rsync(
                cluster_config_file,
                script,
                target,
                cluster_name,
                no_config_cache=no_config_cache,
                down=False)
            
            
command_parts = ["python", target]
            if script_args:
                command_parts += list(script_args)
            elif args is not None:
                command_parts += [args]
            
            
port_forward = [(port, port) for port in list(port_forward)]
            cmd = " ".join(command_parts)
            exec_cluster(
                cluster_config_file,
                cmd=cmd,

[root@ml-test work]# diff /usr/local/python3/lib/python3.7/site-packages/ray/scripts/scripts.py.orig /usr/local/python3/lib/python3.7/site-packages/ray/scripts/scripts.py
1276c1276,1277
<     command_parts = ["python", target]
---
>     # command_parts = ["python", target]
>     command_parts = [sys.executable, target]

when submit get

[ray@ml-test ~]$ ray submit example-full.yaml script.py -v
Loaded cached provider configuration from /tmp/ray-config-c0e056725bcabcdb71cb14638eb1afcf075f2def
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2022-01-28 15:00:43,277 INFO node_provider.py:43 -- ClusterState: Loaded cluster state: ['172.16.210.21', '172.16.210.22']
Fetched IP: 172.16.210.22
Running `rsync --rsh ssh -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/564ad1a02f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore script.py ray@172.16.210.22:~/script.py`
Warning: Permanently added '172.16.210.22' (ECDSA) to the list of known hosts.
sending incremental file list

sent 58 bytes  received 12 bytes  140.00 bytes/sec
total size is 627  speedup is 8.96
`rsync`ed script.py (local) to ~/script.py (remote)
Fetched IP: 172.16.210.22
Running `/usr/local/python3/bin/python3.7 ~/script.py`
This cluster consists of
    2 nodes in total
    8.0 CPU resources in total

(f pid=19102)
Tasks executed
    9248 tasks on 172.16.210.22
    752 tasks on 172.16.210.21
Shared connection to 172.16.210.22 closed.

see
Running /usr/local/python3/bin/python3.7 ~/script.py

wuyeguo · January 28, 2022, 7:11am

about 2
i have no idea too, maybe , because the code as following

github.com

ray-project/ray/blob/ef593fe5d3c864836b80ae77be32635cef42b537/src/ray/common/network_util.cc#L69

      
        
              }
            
            
  if (address == localhost_ip) {
                RAY_LOG(ERROR) << "Failed to find other valid local IP. Using " << localhost_ip
                               << ", not possible to go distributed!";
              }
            
            
  return address;
            }
            
            
namespace NetIf {
            std::vector<PrefixAndPriority> prefixes_and_priorities = {
                {"e", Priority::kVeryHigh},  // (Ethernet) Ex: eth0, enp7s0f1, ens160, en0
            
            
    {"w", Priority::kHigh},  // (WiFi) Ex: wlp0s20f3, wlp2s0
            
            
    {"br", Priority::kNormal},  // (Manual bridge) Ex: br0
                {"ap", Priority::kNormal},  // (Access point) Ex: ap0
            
            
    {"tun", Priority::kExclude},     // (VPN) Ex: tun0
                {"tap", Priority::kExclude},     // (VPN) Ex: tap0

the ip i set is bind to NetIf is “ztbtovnmsg”， not in the allow list
but i don’t know the detail of the ray

Topic		Replies	Views
Ray Up Not Starting Woker Ray Clusters	1	1387	May 12, 2022
Start Ray cluster with error but working Ray Clusters	15	1050	July 4, 2022
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1743	December 15, 2020
Ray Cluster Resources Issue Ray Clusters	2	353	November 30, 2022
Testing Ray Cluster via Manual Setup	6	896	January 22, 2021

Some Issues When I Start My Ray Cluster in centos 7

Related topics