Can't connect to ray cluster when passing `runtime_env` to `ray.init`

Hi team,

when I tried to connect to an existing cluster without a runtime_env parameter, it works fine.

but when I tried to pass a runtime_env to the cluster, it will hang forever and raise the error:

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 704, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 305, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 281, in _create_runtime_env
    raise TimeoutError(
TimeoutError: GetOrCreateRuntimeEnv request failed after 5 attempts. Last exception: HTTP Error 503: Service Unavailable

the code I’m using is just:

In [1]: import ray

In [2]: ray.init(address="ray://localhost:10001", runtime_env={"pip": ["emoji"]})

the head log is like:

12024-04-15 02:38:37,429	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(address='240.52.20.253:6379', host='0.0.0.0', mode='proxy', port=10001, redis_password=None, runtime_env_agent_address='http://240.52.20.253:50805')

22024-04-15 02:40:49,943	INFO proxier.py:696 -- New data connection from client a56378b6ab814ecbb066279448514367: 
32024-04-15 02:40:49,952	INFO proxier.py:223 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"_ray_commit": "9be5a16e3ccad0710bba08d0f75e9ff774ae6880", "pip": {"packages": ["emoji"], "pip_check": false}}.
42024-04-15 02:41:49,001	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 0.5s. 5 retries remaining.
52024-04-15 02:42:49,001	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 1.0s. 4 retries remaining.
62024-04-15 02:43:50,001	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 2.0s. 3 retries remaining.
72024-04-15 02:44:52,001	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 4.0s. 2 retries remaining.
82024-04-15 02:45:56,002	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 8.0s. 1 retries remaining.
92024-04-15 02:47:04,001	WARNING proxier.py:270 -- GetOrCreateRuntimeEnv request failed: HTTP Error 503: Service Unavailable. Retrying after 16.0s. 0 retries remaining.
102024-04-15 02:47:50,016	INFO proxier.py:768 -- a56378b6ab814ecbb066279448514367 last started stream at 1713174049.9413047. Current stream started at 1713174049.9413047.

I digged a little bit in the source code and found it may because I’m using a proxy to connect to the internet, but I’ve already added the IP CIDR 240.52.0.0/16 to my environment variable: no_proxy and NO_PROXY, I’m not sure but it seems not working.

and I also tried the same code but through Job API, and it works:

ray job submit --address="http://localhost:8265" --runtime-env-json='{"pip": ["emoji"]}' -- python test_ray_job.py
Job submission server address: http://localhost:8265

-------------------------------------------------------
Job 'raysubmit_gZuVtxj2Uc8H4q1G' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_gZuVtxj2Uc8H4q1G
  Query the status of the job:
    ray job status raysubmit_gZuVtxj2Uc8H4q1G
  Request the job to be stopped:
    ray job stop raysubmit_gZuVtxj2Uc8H4q1G

can you help to check what should I do to make it work?

thank you.

I’m running ray cluster on K8S (EKS) using KubeRay Operator.

The image for the cluster is rayproject/ray:2.9.0

Python version is 3.8.18

Hi,

Did you happen to find a resolution to this. I am running in the exact same error in a proxied environment. And my no_proxy env variable has all the right CIDR rangers.