I have a autoscaling RayCluster running on AKS. But the autoscaler keeps on getting restarted after some time with the following error -
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1/namespaces/ray-cluster/rayclusters/my-test-raycluster (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8f15df54e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
return cli()
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2335, in kuberay_autoscaler
run_kuberay_autoscaler(cluster_name, cluster_namespace)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 86, in run_kuberay_autoscaler
).run()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 584, in run
self._run()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 389, in _run
self.autoscaler.update()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 384, in update
raise e
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
self._update()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 400, in _update
self.non_terminated_nodes = NonTerminatedNodes(self.provider)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 124, in __init__
self.all_node_ids = provider.non_terminated_nodes({})
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/batching_node_provider.py", line 155, in non_terminated_nodes
self.node_data_dict = self.get_node_data()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 317, in get_node_data
self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 503, in _get
return self.k8s_api_client.get(path)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 255, in get
result = requests.get(url, headers=self._headers, verify=self._verify)
File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1/namespaces/ray-cluster/rayclusters/my-test-raycluster (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8f15df54e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Although there is no visible impact in the scaling of the pods and nodes. I am afraid this error might impact the autoscaling process in some time.
What configs can I add or modify in my Kubernetes cluster or my RayCluster to avoid getting this error ?
Kindly respond. Thanks