How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I have this setup in docker-compose:
services:
ray-head:
image: rayproject/ray-ml:nightly-py310-cpu
container_name: ray-head
env_file:
- .env
command: >
ray start
--head
--dashboard-port=${DASHBOARDPORT}
--dashboard-host=0.0.0.0
--redis-password=${REDISPASSWORD}
--block
ports:
- "6379:${REDISPORT}"
- "8265:${DASHBOARDPORT}"
- "10001:${HEADNODEPORT}"
ray-worker:
image: rayproject/ray-ml:nightly-py310-cpu
env_file:
- .env
depends_on:
- ray-head
command: >
ray start
--address=ray-head:${REDISPORT}
--redis-password=${REDISPASSWORD}
--block
Once the dashboard is ready, in my local host environment, I run:
ray.init(address="ray://localhost:10001")
it failed to connect.
Check the ray_client_server.err
log:
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
2024-08-22 06:05:30,417 INFO server.py:886 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='192.168.224.2:6379', redis_password='yourpassword', runtime_env_agent_address='http://192.168.224.2:55993')
2024-08-22 06:05:50,020 INFO proxier.py:696 -- New data connection from client d2ed476998104d23a93db6bff05c2d5f:
2024-08-22 06:06:33,149 ERROR proxier.py:333 -- SpecificServer startup failed for client: d2ed476998104d23a93db6bff05c2d5f
2024-08-22 06:06:33,150 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 384 for client: d2ed476998104d23a93db6bff05c2d5f
2024-08-22 06:06:33,152 ERROR proxier.py:707 -- Server startup failed for client: d2ed476998104d23a93db6bff05c2d5f, using JobConfig: <ray.job_config.JobConfig object at 0x4007e75f30>!
2024-08-22 06:07:01,075 INFO proxier.py:391 -- Specific server d2ed476998104d23a93db6bff05c2d5f is no longer running, freeing its port 23000
2024-08-22 06:07:03,170 ERROR proxier.py:380 -- Timeout waiting for channel for d2ed476998104d23a93db6bff05c2d5f
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
grpc.channel_ready_future(server.channel).result(
File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
self._block(timeout)
File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-08-22 06:07:03,176 WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
If Im not using the pre-build image, I build my own image Dockerfile
FROM python:3.11.0-slim
RUN pip install --no-cache-dir ray[default]
I can connect and send job.
Why is that?
Once I deploy this setup to AWS ECS, if I cannot send job remotely to the cluster, then all service components need to be inside the head node, then it would be very inefficient?