I have a big cluster on GCP with 50+ nodes that runs for more than 6 hours and every once in a while, Ray will throw the following error:
Healthy:
1 ray_head
50 ray_worker_48
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
2353.0/2353.0 CPU
0.00/19863.624 GiB memory
0.00/8514.660 GiB object_store_memory
Demands:
{'CPU': 1.0}: 6155+ pending tasks/actors
2021-05-17 16:12:54,096 ERROR monitor.py:253 -- Error in monitor loop
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 284, in run
self._run()
File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 189, in _run
"autoscaler_report"] = self.autoscaler.summary()._asdict()
File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 751, in summary
all_node_ids = self.provider.non_terminated_nodes(tag_filters={})
File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 83, in non_terminated_nodes
response = self.compute.instances().list(
File "/usr/local/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 920, in execute
resp, content = _retry_request(
File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
raise exception
File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 209, in request
self.credentials.before_request(self._request, method, uri, request_headers)
File "/usr/local/lib/python3.8/site-packages/google/auth/credentials.py", line 133, in before_request
self.refresh(request)
File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
self._retrieve_info(request)
File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
info = _metadata.get_service_account_info(
File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
return get(request, path, params={"recursive": "true"})
File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 150, in get
response = request(url=url, method="GET", headers=_METADATA_HEADERS)
File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 119, in __call__
response, data = self.http.request(
File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1708, in request
(response, content) = self._request(
File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1424, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1347, in _conn_request
conn.request(method, request_uri, body, headers)
File "/usr/local/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/usr/local/lib/python3.8/http/client.py", line 968, in send
self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
When this happens, the monitor logs stops logging, and I’m assuming the autoscaler crashes as well; this doesn’t seem to happen when using a smaller cluster (much less nodes). Since I want my workload to be done in hours instead of days, I need the CPU power, hence big clusters.
Does the autoscaler has a GCP node limit on which it can reliably work?