Ray Client remote does not work

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have this setup in docker-compose:

services:
  ray-head:
    image: rayproject/ray-ml:nightly-py310-cpu
    container_name: ray-head
    env_file:
      - .env
    command: >
      ray start
      --head
      --dashboard-port=${DASHBOARDPORT} 
      --dashboard-host=0.0.0.0 
      --redis-password=${REDISPASSWORD}
      --block
    ports:
      - "6379:${REDISPORT}"
      - "8265:${DASHBOARDPORT}"
      - "10001:${HEADNODEPORT}"
  ray-worker:
    image: rayproject/ray-ml:nightly-py310-cpu
    env_file:
      - .env
    depends_on:
      - ray-head
    command: >
      ray start
      --address=ray-head:${REDISPORT} 
      --redis-password=${REDISPASSWORD}
      --block

Once the dashboard is ready, in my local host environment, I run:

ray.init(address="ray://localhost:10001")

it failed to connect.
Check the ray_client_server.err log:

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
2024-08-22 06:05:30,417 INFO server.py:886 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='192.168.224.2:6379', redis_password='yourpassword', runtime_env_agent_address='http://192.168.224.2:55993')
2024-08-22 06:05:50,020 INFO proxier.py:696 -- New data connection from client d2ed476998104d23a93db6bff05c2d5f:
2024-08-22 06:06:33,149 ERROR proxier.py:333 -- SpecificServer startup failed for client: d2ed476998104d23a93db6bff05c2d5f
2024-08-22 06:06:33,150 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 384 for client: d2ed476998104d23a93db6bff05c2d5f
2024-08-22 06:06:33,152 ERROR proxier.py:707 -- Server startup failed for client: d2ed476998104d23a93db6bff05c2d5f, using JobConfig: <ray.job_config.JobConfig object at 0x4007e75f30>!
2024-08-22 06:07:01,075 INFO proxier.py:391 -- Specific server d2ed476998104d23a93db6bff05c2d5f is no longer running, freeing its port 23000
2024-08-22 06:07:03,170 ERROR proxier.py:380 -- Timeout waiting for channel for d2ed476998104d23a93db6bff05c2d5f
Traceback (most recent call last):
 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
 grpc.channel_ready_future(server.channel).result(
 File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
 self._block(timeout)
 File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
 raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-08-22 06:07:03,176 WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.

If Im not using the pre-build image, I build my own image Dockerfile

FROM python:3.11.0-slim

RUN pip install --no-cache-dir ray[default]

I can connect and send job.

Why is that?
Once I deploy this setup to AWS ECS, if I cannot send job remotely to the cluster, then all service components need to be inside the head node, then it would be very inefficient?

maybe my understanding of Ray Client is not correct. if I use my own custom pip install ray[default] image.
I will get these logs from the ray_client_server.err

2024-08-22 14:05:27,672 INFO server.py:886 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='172.18.0.2:6379', redis_password='yourpassword', runtime_env_agent_address='http://172.18.0.2:64784')
2024-08-22 14:05:29,186 ERROR proxier.py:351 -- Unable to find channel for client: 5e030aaf93524b4dac939266ea8bedb3
2024-08-22 14:05:29,186 WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-08-22 14:05:31,187 ERROR proxier.py:351 -- Unable to find channel for client: 5e030aaf93524b4dac939266ea8bedb3
2024-08-22 14:05:31,188 WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-08-22 14:05:33,188 ERROR proxier.py:351 -- Unable to find channel for client: 5e030aaf93524b4dac939266ea8bedb3
2024-08-22 14:05:33,189 WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-08-22 14:05:35,190 ERROR proxier.py:351 -- Unable to find channel for client: 5e030aaf93524b4dac939266ea8bedb3
2024-08-22 14:05:35,190 WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-08-22 14:05:37,196 ERROR proxier.py:351 -- Unable to find channel for client: 5e030aaf93524b4dac939266ea8bedb3
2024-08-22 14:05:37,197 WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.
2024-08-22 14:06:36,595 INFO proxier.py:696 -- New data connection from client 8d53721b5d8a49388bc5b86e9333291c:
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1724335596.608485 178 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
2024-08-22 14:06:37,121 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 328 for client: 8d53721b5d8a49388bc5b86e9333291c
2024-08-22 14:08:53,061 INFO proxier.py:696 -- New data connection from client aa023df182184ee9a27f7a914243d5a4:
I0000 00:00:1724335733.072788 402 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
2024-08-22 14:08:53,590 INFO proxier.py:341 -- SpecificServer started on port: 23001 with PID: 450 for client: aa023df182184ee9a27f7a914243d5a4

As you can see when the ray cluster launch, the Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001 was Unable to find channel.
And then, all subsequent ray.init(address='ray://localhost:10001') were able to send jobs to the cluster.

I thought that the cluster will launch an internal Client, then remotely, we can use CLI client to execute jobs through the internal Client.

What do you mean be “internal Client”?

Hi Sam,
using the image rayproject/ray-ml:nightly-py310-cpu, I noticed on the ray_client_server.err log:

2024-08-22 06:05:30,417 INFO server.py:886 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='192.168.224.2:6379', redis_password='yourpassword', runtime_env_agent_address='http://192.168.224.2:55993')

then any subsequent ray.init to the port 10001, would fail and it redirect to port 23000.
that’s why I assume that ray head launched an internal client and occupied the port.
I don’t get that issue with custom build image.

Hmm, so are you running ray.init from your laptop to connect to the Ray Cluster and than having other code also trying to ray run.init to the same Cluster.

sorry for late respond.
I ran everything in local docker-compose setup.

I used the exact config from the docker-compose.yml posted above.
And have another service (same network) to connect to the cluster in local.

Then encountered the error

2024-08-22 06:06:33,149 ERROR proxier.py:333 -- SpecificServer startup failed for client: d2ed476998104d23a93db6bff05c2d5f
2024-08-22 06:06:33,150 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 384 for client: d2ed476998104d23a93db6bff05c2d5f

if I use my own custom image and do pip install ray[default] in similar docker-compose setup.
I will able to run ray.init(address="ray://localhost:10001") without failure.

I’m still not quite understanding your setup; so you’re deploying a Ray Cluster onto AWS ECS and then also initiating a Ray Client connection from a piece of Compute on the Cloud to said Cluster?