Does Ray Autoscaler has a maximum numbers of nodes it can handle?

I have a big cluster on GCP with 50+ nodes that runs for more than 6 hours and every once in a while, Ray will throw the following error:

Healthy:
 1 ray_head
 50 ray_worker_48
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 2353.0/2353.0 CPU
 0.00/19863.624 GiB memory
 0.00/8514.660 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 6155+ pending tasks/actors
2021-05-17 16:12:54,096	ERROR monitor.py:253 -- Error in monitor loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 189, in _run
    "autoscaler_report"] = self.autoscaler.summary()._asdict()
  File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 751, in summary
    all_node_ids = self.provider.non_terminated_nodes(tag_filters={})
  File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 83, in non_terminated_nodes
    response = self.compute.instances().list(
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 920, in execute
    resp, content = _retry_request(
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
    raise exception
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 209, in request
    self.credentials.before_request(self._request, method, uri, request_headers)
  File "/usr/local/lib/python3.8/site-packages/google/auth/credentials.py", line 133, in before_request
    self.refresh(request)
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
    self._retrieve_info(request)
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
    info = _metadata.get_service_account_info(
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
    return get(request, path, params={"recursive": "true"})
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 150, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 119, in __call__
    response, data = self.http.request(
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1708, in request
    (response, content) = self._request(
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1424, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1347, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python3.8/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1007, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 968, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

When this happens, the monitor logs stops logging, and I’m assuming the autoscaler crashes as well; this doesn’t seem to happen when using a smaller cluster (much less nodes). Since I want my workload to be done in hours instead of days, I need the CPU power, hence big clusters.

Does the autoscaler has a GCP node limit on which it can reliably work?

Bumping this. What are limits to ray auto scaling please?

Hello,

If this helps, here is the link to ray’s benchmark, it says 250+ nodes: