Autoscaler container restarts with requests.exceptions.ConnectionError

I have a autoscaling RayCluster running on AKS. But the autoscaler keeps on getting restarted after some time with the following error -

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1/namespaces/ray-cluster/rayclusters/my-test-raycluster (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8f15df54e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2335, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 86, in run_kuberay_autoscaler
    ).run()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 584, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 389, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 384, in update
    raise e
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 400, in _update
    self.non_terminated_nodes = NonTerminatedNodes(self.provider)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 124, in __init__
    self.all_node_ids = provider.non_terminated_nodes({})
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/batching_node_provider.py", line 155, in non_terminated_nodes
    self.node_data_dict = self.get_node_data()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 317, in get_node_data
    self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 503, in _get
    return self.k8s_api_client.get(path)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 255, in get
    result = requests.get(url, headers=self._headers, verify=self._verify)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1/namespaces/ray-cluster/rayclusters/my-test-raycluster (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8f15df54e0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Although there is no visible impact in the scaling of the pods and nodes. I am afraid this error might impact the autoscaling process in some time.
What configs can I add or modify in my Kubernetes cluster or my RayCluster to avoid getting this error ?

Kindly respond. Thanks

Judging from the URL and the hostname it does look like the autoscaler is unable to connect to Kubernetes API server. It could indicate an issue with Kubernetes cluster DNS. You said that this doesn’t affect scaling of the pods and nodes, which I understood as you observe that ray worker nodes are created and removed correctly despite this error and occasional restart of Ray autoscaler.

Here are a few diagnostic steps to try:

  • Check for any errors in the Kubernetes api server logs.
  • Check if you have more than one Ray autoscaler (or Ray cluster) running, so the error may be not coming from the working cluster, but its twin.
  • Run a pod with on the same node as Ray autoscaler calling kubectl -n ray-cluster describe raycluster/my-test-raycluster in a loop (you should be able to give it the same service account as ray uses).
1 Like