It has failed again. Here is the last part of the monitor.log
:
2021-03-19 05:38:07,253 INFO monitor.py:182 -- :event_summary:Resized to 0 CPUs.
2021-03-19 05:38:12,386 ERROR autoscaler.py:270 -- StandardAutoscaler: ray-worker-cpu-m5z86: Terminating failed to setup/initialize node.
2021-03-19 05:38:12,393 ERROR autoscaler.py:142 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
self._update()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
self._get_node_type(node_id) + " (launch failed).",
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
pod = core_api().read_namespaced_pod(node_id, self.namespace)
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
return self.rest_client.GET(url,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
return self.request("GET", url,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '0d2cb42d-1c2a-4981-9006-07449ddc528a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 12:38:12 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-m5z86\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-m5z86","kind":"pods"},"code":404}
2021-03-19 05:38:12,393 CRITICAL autoscaler.py:152 -- StandardAutoscaler: Too many errors, abort.
2021-03-19 05:38:12,394 ERROR monitor.py:243 -- Error in monitor loop
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 274, in run
self._run()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 177, in _run
self.autoscaler.update()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 154, in update
raise e
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
self._update()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
self._get_node_type(node_id) + " (launch failed).",
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
pod = core_api().read_namespaced_pod(node_id, self.namespace)
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
return self.rest_client.GET(url,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
return self.request("GET", url,
File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '0d2cb42d-1c2a-4981-9006-07449ddc528a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 12:38:12 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-m5z86\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-m5z86","kind":"pods"},"code":404}