Issues running ray on yarn with skein

fersarr · March 18, 2024, 3:40pm

Hi. I am following this page in the docs Deploying on YARN — Ray 2.9.3 in order to run Ray on top of YARN using Skein. The cluster and the tasks seem to run fine if I have 1-2 workers but if I add more, even 3 or 5 it always fails and I am struggling to understand why. I am using ray 2.9.3 (latest) and I will share the logs I have but feel free to ask me to run more commands or share more info if it would help.

The error looks like this (more details in the logs below):

2 nodes have joined so far (stdout), waiting for 5 more. seconds: 3
e[33m(raylet)e[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x4b6da4) [0x55e245810da4] ray::rpc::GcsRpcClient::GetInternalConfig()::{lambda()#2}::operator()()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x45cd05) [0x55e2457b6d05] ray::rpc::ClientCallImpl<>::OnReplyReceived()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x273e15) [0x55e2455cde15] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b40fe) [0x55e24590e0fe] EventTracker::RecordExecution()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad4ee) [0x55e2459074ee] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad966) [0x55e245907966] boost::asio::detail::completion_handler<>::do_complete()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6519b) [0x55e245fbf19b] boost::asio::detail::scheduler::do_run_one()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67729) [0x55e245fc1729] boost::asio::detail::scheduler::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67c42) [0x55e245fc1c42] boost::asio::io_context::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1cfafa) [0x55e245529afa] main
    /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd64f610555] __libc_start_main
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2249c7) [0x55e24557e9c7]
    
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: *** SIGABRT received at time=1710774374 on cpu 54 ***
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: PC: @     0x7fd64f624387  (unknown)  raise
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd6503f3630  (unknown)  (unknown)
    [2024-03-18 15:06:14,408 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc30a06  (unknown)  (unknown)
    [2024-03-18 15:06:14,410 E 55847 55847] (raylet) logging.cc:361:     @     0x55e24617bc80  161800592  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc31fb0  (unknown)  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @ 0x3de907894810c083  (unknown)  (unknown)

Skein config yaml:

name: ray_testing
queue: default
tags: ["skein", "ray"]
acls:
  enable: true
  view_users: ["myusername","dr.who"]
  modify_users: ["myusername"]
  ui_users: ["myusername","dr.who"]

master:
  log_level: debug

services:
    # Head service.
    ray-head:
        # There should only be one instance of the head node per cluster.
        instances: 1
        resources:
            # The resources for the head node.
            vcores: 1
            memory: 8 GiB
        # files: # we use NFS
        script: |
            echo "FER: ray master started date: `date` from host: `hostname`. h: $(hostname -i)"
            source /myuserpath/pyenvs/ray_nfs_nfs_next/bin/activate

            # This stores the Ray head address in the Skein key-value store so that the workers can retrieve it later.
            export RAY_WORKER_PORT=6379  # used by the python script too
            skein kv put current --key=RAY_HEAD_ADDRESS --value=$(hostname -i)
            skein kv put current --key=RAY_WORKER_PORT --value=${RAY_WORKER_PORT}

            # This command starts all the processes needed on the ray head node.
            # By default, we set object store memory and heap memory to roughly 200 MB. This is conservative
            # and should be set according to application needs.
            ray start --head --port=$RAY_WORKER_PORT --object-store-memory=200000000 --memory 2000000000 --num-cpus=1 --include-dashboard=False

            # Run the Ray user script (uses $RAY_WORKER_PORT).
            python /myuserpath/workspace/rayws_nfs/vanilla_ray.py

            # After the user script has executed, all started processes should also die.
            sleep 2000
            ray stop
            skein application shutdown current

    # Worker service.
    ray-worker:
        # The number of instances to start initially. This can be scaled dynamically later.
        instances: 5
        resources:
            # The resources for the worker node
            vcores: 1
            memory: 4 GiB
        # files: # we use NFS
        #     environment: environment.tar.gz
        depends:
            - ray-head  # Don't start any worker nodes until the head node is started
        script: |
            echo "FER: ray worker started date: `date` from host: `hostname`. h: $(hostname -i)"
            source /myuserpath/pyenvs/ray_nfs_nfs_next/bin/activate

            # This command gets the head node address from the skein key-value store.
            echo "FER: ray worker wait for RAY_HEAD_ADDRESS"
            sleep 5 # margin so that RAY_HEAD_ADDRESS is defined
            RAY_HEAD_ADDRESS=$(skein kv get --key=RAY_HEAD_ADDRESS current)
            RAY_WORKER_PORT=$(skein kv get --key=RAY_WORKER_PORT current)
            echo "FER: ray worker RAY_HEAD_ADDRESS: $RAY_HEAD_ADDRESS PORT: $RAY_WORKER_PORT"

            # The below command starts all the processes needed on a ray worker node, blocking until killed with sigterm.
            # After sigterm, all started processes should also die (ray stop).
            ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:$RAY_WORKER_PORT --block; ray stop
            echo "FER: ray worker finished date: `date` from host: `hostname`. RAY_HEAD_ADDRESS: $RAY_HEAD_ADDRESS PORT: $RAY_WORKER_PORT"

and these are the logs (4 different log sections/files here):


Logs in the Skein Application Master container:

24/03/18 15:05:56 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
24/03/18 15:05:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/18 15:05:57 INFO skein.ApplicationMaster: Running as user myusername
24/03/18 15:05:57 INFO conf.Configuration: resource-types.xml not found
24/03/18 15:05:57 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
24/03/18 15:05:57 INFO skein.ApplicationMaster: Application specification successfully loaded
24/03/18 15:05:58 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at machine_id_168.internaldomain.com/internal_IP.210:8030
24/03/18 15:05:58 INFO skein.ApplicationMaster: gRPC server started at machine_id_146.internaldomain.com:42247
24/03/18 15:05:58 DEBUG skein.WebUI: Serving resources from jar:file:/var/opt/hadoop/local-dirs/usercache/myusername/appcache/application_1709394358549_17779/filecache/10/skein.jar!/META-INF/resources/
24/03/18 15:05:58 INFO skein.WebUI: UI ACLs are enabled, restricted to 2 users
24/03/18 15:05:58 INFO skein.ApplicationMaster: WebUI server started at machine_id_146.internaldomain.com:44962
24/03/18 15:05:58 INFO skein.ApplicationMaster: Registering application with resource manager
24/03/18 15:05:58 DEBUG skein.ApplicationMaster: Determining resources available for application master
24/03/18 15:05:58 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at machine_id_168.internaldomain.com/internal_IP.210:8032
24/03/18 15:05:59 INFO skein.ApplicationMaster: Initializing service 'ray-head'.
24/03/18 15:05:59 INFO skein.ApplicationMaster: REQUESTED: ray-head_0
24/03/18 15:05:59 INFO skein.ApplicationMaster: Initializing service 'ray-worker'.
24/03/18 15:05:59 INFO skein.ApplicationMaster: WAITING: ray-worker_0
24/03/18 15:05:59 INFO skein.ApplicationMaster: WAITING: ray-worker_1
24/03/18 15:05:59 INFO skein.ApplicationMaster: WAITING: ray-worker_2
24/03/18 15:05:59 INFO skein.ApplicationMaster: WAITING: ray-worker_3
24/03/18 15:05:59 INFO skein.ApplicationMaster: WAITING: ray-worker_4
24/03/18 15:05:59 DEBUG skein.ApplicationMaster: Starting allocator thread
24/03/18 15:05:59 DEBUG skein.ApplicationMaster: Heartbeat intervals [idle: 5000 ms, pending: 1000 ms]
24/03/18 15:06:00 DEBUG skein.ApplicationMaster: Received 1 new containers
24/03/18 15:06:00 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000002...
24/03/18 15:06:00 INFO skein.ApplicationMaster: RUNNING: ray-head_0 on container_e44_1709394358549_17779_01_000002
24/03/18 15:06:00 INFO skein.ApplicationMaster: REQUESTED: ray-worker_0
24/03/18 15:06:00 INFO skein.ApplicationMaster: REQUESTED: ray-worker_1
24/03/18 15:06:00 INFO skein.ApplicationMaster: REQUESTED: ray-worker_2
24/03/18 15:06:00 INFO skein.ApplicationMaster: REQUESTED: ray-worker_3
24/03/18 15:06:00 INFO skein.ApplicationMaster: REQUESTED: ray-worker_4
24/03/18 15:06:02 DEBUG skein.ApplicationMaster: Received 5 new containers
24/03/18 15:06:02 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000003...
24/03/18 15:06:02 INFO skein.ApplicationMaster: RUNNING: ray-worker_0 on container_e44_1709394358549_17779_01_000003
24/03/18 15:06:02 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000004...
24/03/18 15:06:02 INFO skein.ApplicationMaster: RUNNING: ray-worker_1 on container_e44_1709394358549_17779_01_000004
24/03/18 15:06:02 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000005...
24/03/18 15:06:02 INFO skein.ApplicationMaster: RUNNING: ray-worker_2 on container_e44_1709394358549_17779_01_000005
24/03/18 15:06:02 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000006...
24/03/18 15:06:02 INFO skein.ApplicationMaster: RUNNING: ray-worker_3 on container_e44_1709394358549_17779_01_000006
24/03/18 15:06:02 INFO skein.ApplicationMaster: Starting container_e44_1709394358549_17779_01_000007...
24/03/18 15:06:02 INFO skein.ApplicationMaster: RUNNING: ray-worker_4 on container_e44_1709394358549_17779_01_000007
24/03/18 15:06:22 DEBUG skein.ApplicationMaster: Received 5 completed containers
24/03/18 15:06:22 INFO skein.ApplicationMaster: SUCCEEDED: ray-worker_4 - Completed successfully.
24/03/18 15:06:22 INFO skein.ApplicationMaster: SUCCEEDED: ray-worker_2 - Completed successfully.
24/03/18 15:06:22 INFO skein.ApplicationMaster: SUCCEEDED: ray-worker_0 - Completed successfully.
24/03/18 15:06:22 INFO skein.ApplicationMaster: SUCCEEDED: ray-worker_3 - Completed successfully.
24/03/18 15:06:22 INFO skein.ApplicationMaster: SUCCEEDED: ray-worker_1 - Completed successfully.

logs in rey-head.log

FER: ray master started date: Mon 18 Mar 15:06:02 GMT 2024 from host: machine_id_151.internaldomain.com. h: internal_IP.6
2024-03-18 15:06:08,040 INFO usage_lib.py:454 -- Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-03-18 15:06:08,041 INFO scripts.py:744 -- e[37mLocal node IPe[39m: e[1minternal_IP.6e[22m
2024-03-18 15:06:11,477 SUCC scripts.py:781 -- e[32m--------------------e[39m
2024-03-18 15:06:11,477 SUCC scripts.py:782 -- e[32mRay runtime started.e[39m
2024-03-18 15:06:11,477 SUCC scripts.py:783 -- e[32m--------------------e[39m
2024-03-18 15:06:11,478 INFO scripts.py:785 -- e[36mNext stepse[39m
2024-03-18 15:06:11,478 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-03-18 15:06:11,478 INFO scripts.py:791 -- e[1m  ray start --address='internal_IP.6:6379'e[22m
2024-03-18 15:06:11,478 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-03-18 15:06:11,478 INFO scripts.py:802 -- e[35mimporte[39me[26m ray
2024-03-18 15:06:11,479 INFO scripts.py:803 -- raye[35m.e[39me[26minit()
2024-03-18 15:06:11,479 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-03-18 15:06:11,479 INFO scripts.py:835 -- e[1m  ray stope[22m
2024-03-18 15:06:11,479 INFO scripts.py:838 -- To view the status of the cluster, use
2024-03-18 15:06:11,479 INFO scripts.py:839 --   e[1mray statuse[22me[26m
before main stdout
logging timestamp: 18-03-2024__15-06-12
not local port: 6379
2024-03-18 15:06:12,193 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: internal_IP.6:6379...
2024-03-18 15:06:12,201 INFO worker.py:1724 -- Connected to Ray cluster.
1 nodes have joined so far (stdout), waiting for 6 more. seconds: 0
1 nodes have joined so far (stdout), waiting for 6 more. seconds: 1
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 2
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 3
e[33m(raylet)e[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x4b6da4) [0x55e245810da4] ray::rpc::GcsRpcClient::GetInternalConfig()::{lambda()#2}::operator()()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x45cd05) [0x55e2457b6d05] ray::rpc::ClientCallImpl<>::OnReplyReceived()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x273e15) [0x55e2455cde15] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b40fe) [0x55e24590e0fe] EventTracker::RecordExecution()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad4ee) [0x55e2459074ee] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad966) [0x55e245907966] boost::asio::detail::completion_handler<>::do_complete()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6519b) [0x55e245fbf19b] boost::asio::detail::scheduler::do_run_one()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67729) [0x55e245fc1729] boost::asio::detail::scheduler::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67c42) [0x55e245fc1c42] boost::asio::io_context::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1cfafa) [0x55e245529afa] main
    /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd64f610555] __libc_start_main
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2249c7) [0x55e24557e9c7]
    
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: *** SIGABRT received at time=1710774374 on cpu 54 ***
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: PC: @     0x7fd64f624387  (unknown)  raise
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd6503f3630  (unknown)  (unknown)
    [2024-03-18 15:06:14,408 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc30a06  (unknown)  (unknown)
    [2024-03-18 15:06:14,410 E 55847 55847] (raylet) logging.cc:361:     @     0x55e24617bc80  161800592  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc31fb0  (unknown)  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @ 0x3de907894810c083  (unknown)  (unknown)

2 nodes have joined so far (stdout), waiting for 5 more. seconds: 4
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 5
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 6
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 7
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 8
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 9
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 10
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 11
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 12
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 13
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 14
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 15
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 16
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 17
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 18
e[33m(raylet)e[0m The node with node id: ac7e58da8c8dd9ec777f827f6cf49986d52213273fdc57206cbc5925 and address: internal_IP.50 and node name: internal_IP.50 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
    (2) raylet has lagging heartbeats due to slow network or busy workload.
e[33m(raylet)e[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x4b6da4) [0x55e245810da4] ray::rpc::GcsRpcClient::GetInternalConfig()::{lambda()#2}::operator()()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x45cd05) [0x55e2457b6d05] ray::rpc::ClientCallImpl<>::OnReplyReceived()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x273e15) [0x55e2455cde15] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b40fe) [0x55e24590e0fe] EventTracker::RecordExecution()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad4ee) [0x55e2459074ee] std::_Function_handler<>::_M_invoke()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad966) [0x55e245907966] boost::asio::detail::completion_handler<>::do_complete()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6519b) [0x55e245fbf19b] boost::asio::detail::scheduler::do_run_one()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67729) [0x55e245fc1729] boost::asio::detail::scheduler::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67c42) [0x55e245fc1c42] boost::asio::io_context::run()
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1cfafa) [0x55e245529afa] main
    /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd64f610555] __libc_start_main
    /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2249c7) [0x55e24557e9c7]
    
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: *** SIGABRT received at time=1710774374 on cpu 54 ***
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361: PC: @     0x7fd64f624387  (unknown)  raise
    [2024-03-18 15:06:14,407 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd6503f3630  (unknown)  (unknown)
    [2024-03-18 15:06:14,408 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc30a06  (unknown)  (unknown)
    [2024-03-18 15:06:14,410 E 55847 55847] (raylet) logging.cc:361:     @     0x55e24617bc80  161800592  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @     0x7fd64fc31fb0  (unknown)  (unknown)
    [2024-03-18 15:06:14,411 E 55847 55847] (raylet) logging.cc:361:     @ 0x3de907894810c083  (unknown)  (unknown)
e[32m [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)e[0m
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 19
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 20
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 21
2 nodes have joined so far (stdout), waiting for 5 more. seconds: 22

And so on (it never gets more connected nodes, even though I configured 5 worker nodes)

logs from the workers ray-worker_0, ray-worker_1, ray-worker_2, ray-worker_3:


FER: ray worker started date: Mon 18 Mar 15:06:04 GMT 2024 from host: machine_id_194. h: internal_IP.50
FER: ray worker wait for RAY_HEAD_ADDRESS
FER: ray worker RAY_HEAD_ADDRESS: internal_IP.6 PORT: 6379
[2024-03-18 15:06:14,311 I 55679 55679] global_state_accessor.cc:432: This node has an IP address of internal_IP.50, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
2024-03-18 15:06:13,963 INFO scripts.py:926 -- e[37mLocal node IPe[39m: e[1minternal_IP.50e[22m
2024-03-18 15:06:14,313 SUCC scripts.py:939 -- e[32m--------------------e[39m
2024-03-18 15:06:14,313 SUCC scripts.py:940 -- e[32mRay runtime started.e[39m
2024-03-18 15:06:14,313 SUCC scripts.py:941 -- e[32m--------------------e[39m
2024-03-18 15:06:14,313 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-03-18 15:06:14,313 INFO scripts.py:944 -- e[1m  ray stope[22m
2024-03-18 15:06:14,313 INFO scripts.py:952 -- e[36me[1m--blocke[22me[39m
2024-03-18 15:06:14,313 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-03-18 15:06:14,313 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-03-18 15:06:20,589 INFO scripts.py:1163 -- Did not find any active Ray processes.
FER: ray worker finished date: Mon 18 Mar 15:06:20 GMT 2024 from host: machine_id_194. RAY_HEAD_ADDRESS: internal_IP.6 PORT: 6379

ray-worker_4’s logs are a bit different, it tries to stop a logging process which seems suspicious:

FER: ray worker started date: Mon 18 Mar 15:06:04 GMT 2024 from host: machine_id_194. h: internal_IP.50
FER: ray worker wait for RAY_HEAD_ADDRESS
FER: ray worker RAY_HEAD_ADDRESS: internal_IP.6 PORT: 6379
[2024-03-18 15:06:14,311 I 55680 55680] global_state_accessor.cc:432: This node has an IP address of internal_IP.50, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
2024-03-18 15:06:13,957 INFO scripts.py:926 -- e[37mLocal node IPe[39m: e[1minternal_IP.50e[22m
2024-03-18 15:06:14,312 SUCC scripts.py:939 -- e[32m--------------------e[39m
2024-03-18 15:06:14,313 SUCC scripts.py:940 -- e[32mRay runtime started.e[39m
2024-03-18 15:06:14,313 SUCC scripts.py:941 -- e[32m--------------------e[39m
2024-03-18 15:06:14,313 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-03-18 15:06:14,313 INFO scripts.py:944 -- e[1m  ray stope[22m
2024-03-18 15:06:14,313 INFO scripts.py:952 -- e[36me[1m--blocke[22me[39m
2024-03-18 15:06:14,313 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-03-18 15:06:14,313 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2024-03-18 15:06:19,418 VINFO scripts.py:1099 -- Attempted to stop `e[1m/my_user_path/pyenvs/ray_nfs_nfs_next/bin/python -u /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493 --logs-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493/logs --gcs-address=internal_IP.6:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5e[22me[26m`, but process was already dead.
2024-03-18 15:06:19,419 VINFO scripts.py:1099 -- Attempted to stop `e[1m/my_user_path/pyenvs/ray_nfs_nfs_next/bin/python -u /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493 --logs-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493/logs --gcs-address=internal_IP.6:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5e[22me[26m`, but process was already dead.
2024-03-18 15:06:19,496 VINFO scripts.py:1099 -- Attempted to stop `e[1m/my_user_path/pyenvs/ray_nfs_nfs_next/bin/python -u /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493 --logs-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493/logs --gcs-address=internal_IP.6:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5e[22me[26m`, but process was already dead.
2024-03-18 15:06:19,497 VINFO scripts.py:1099 -- Attempted to stop `e[1m/my_user_path/pyenvs/ray_nfs_nfs_next/bin/python -u /my_user_path/pyenvs/ray_nfs_nfs_next/lib/python3.8/site-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493 --logs-dir=/tmp/ray/session_2024-03-18_15-06-08_041699_48493/logs --gcs-address=internal_IP.6:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5e[22me[26m`, but process was already dead.
2024-03-18 15:06:19,818 INFO scripts.py:1163 -- Did not find any active Ray processes.
FER: ray worker finished date: Mon 18 Mar 15:06:20 GMT 2024 from host: machine_id_194. RAY_HEAD_ADDRESS: internal_IP.6 PORT: 6379

Topic		Replies	Views
Apache Ray Yarn Multiple Clusters Ray Core	2	271	March 10, 2022
Ray on Yarn (MapR - failing to get RAY_HEAD_ADDRESS) Ray Clusters	1	328	July 7, 2021
Ray cluster uses only Head node Ray Clusters	3	440	June 28, 2021
Some Issues When I Start My Ray Cluster in centos 7 Ray Clusters	4	599	January 28, 2022
Problems lauching gcp cluster Ray Core	4	722	July 7, 2022

Issues running ray on yarn with skein

Related topics