Hi @lobanov
Thanks for providing a detailed breakdown of steps to try .
From the Ray Head, the logs below does not suggest any issue.
2024-08-01 10:38:51,011 INFO usage_lib.py:398 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-08-01 10:38:51,011 INFO scripts.py:710 -- Local node IP: 10.244.1.11
2024-08-01 10:38:53,427 SUCC scripts.py:747 -- --------------------
2024-08-01 10:38:53,427 SUCC scripts.py:748 -- Ray runtime started.
2024-08-01 10:38:53,427 SUCC scripts.py:749 -- --------------------
2024-08-01 10:38:53,427 INFO scripts.py:751 -- Next steps
2024-08-01 10:38:53,428 INFO scripts.py:754 -- To add another node to this Ray cluster, run
2024-08-01 10:38:53,428 INFO scripts.py:757 -- ray start --address='10.244.1.11:6379'
2024-08-01 10:38:53,428 INFO scripts.py:766 -- To connect to this Ray cluster:
2024-08-01 10:38:53,428 INFO scripts.py:768 -- import ray
2024-08-01 10:38:53,428 INFO scripts.py:769 -- ray.init()
2024-08-01 10:38:53,428 INFO scripts.py:781 -- To submit a Ray job using the Ray Jobs CLI:
2024-08-01 10:38:53,428 INFO scripts.py:782 -- RAY_ADDRESS='http://10.244.1.11:8265' ray job submit --working-dir . -- python my_script.py
2024-08-01 10:38:53,428 INFO scripts.py:791 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
2024-08-01 10:38:53,428 INFO scripts.py:795 -- for more information on submitting Ray jobs to the Ray cluster.
2024-08-01 10:38:53,428 INFO scripts.py:800 -- To terminate the Ray runtime, run
2024-08-01 10:38:53,428 INFO scripts.py:801 -- ray stop
2024-08-01 10:38:53,428 INFO scripts.py:804 -- To view the status of the cluster, use
2024-08-01 10:38:53,428 INFO scripts.py:805 -- ray status
2024-08-01 10:38:53,428 INFO scripts.py:809 -- To monitor and debug Ray, view the dashboard at
2024-08-01 10:38:53,428 INFO scripts.py:810 -- 10.244.1.11:8265
2024-08-01 10:38:53,428 INFO scripts.py:817 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2024-08-01 10:38:53,428 INFO utils.py:112 -- Overwriting previous Ray address (10.244.1.10:6379). Running ray.init() on this node will now connect to the new instance at 10.244.1.11:6379. To override this behavior, pass address=10.244.1.10:6379 to ray.init().
2024-08-01 10:38:53,428 INFO scripts.py:917 -- --block
2024-08-01 10:38:53,428 INFO scripts.py:918 -- This command will now block forever until terminated by a signal.
2024-08-01 10:38:53,428 INFO scripts.py:921 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
However, in the Ray Head, when I ran the following command, python -c "import ray; ray.init();
the following errors were raised.
2024-08-01 11:01:08,956 INFO worker.py:1314 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2024-08-01 11:01:08,956 INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.244.1.11:6379...
2024-08-01 11:01:08,966 INFO worker.py:1616 -- Connected to Ray cluster. View the dashboard at 10.244.1.11:8265
2024-08-01 11:01:08,980 WARNING worker.py:1964 -- The autoscaler failed with the following error:
Traceback (most recent call last):
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib64/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen
httplib_response = self._make_request(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request
self._validate_conn(conn)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
conn.connect()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 799, in urlopen
retries = retries.increment(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/default/rayclusters/raycluster-kuberay (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 549, in run
self._initialize_autoscaler()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/monitor.py", line 233, in _initialize_autoscaler
self.autoscaler = StandardAutoscaler(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 251, in __init__
self.reset(errors_fatal=True)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1111, in reset
raise e
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1028, in reset
new_config = self.config_reader()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
return self._fetch_ray_cr_from_k8s()
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib64/python3.9/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 85, in _fetch_ray_cr_from_k8s
result = requests.get(
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/opt/app-root/src/.cache/pypoetry/virtualenvs/rl-suite-l5I817SR-py3.9/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/default/rayclusters/raycluster-kuberay (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x75886ad7c580>: Failed to establish a new connection: [Errno -2] Name or service not known'))
Further testing via kubectl get svc kubernetes -o wide
returns
Attempting to reach the service via curl -k https://kubernetes.default:443
does not work, could not resolve host: kubernetes.default. However, reaching the service via curl -k 10.96.0.1:443
does work, and the client is able to send a HTTP request to a HTTPS server.
Do you think the autoscaler / kubernetes host error could be the issue? I searched the slack channel for previous similar issues, but it seems that it was related to an older version of Ray (1.4), where I am using Ray 2.4 now.