Does Ray Autoscaler has a maximum numbers of nodes it can handle?

philippe-boyd-maxa · May 17, 2021, 4:33pm

I have a big cluster on GCP with 50+ nodes that runs for more than 6 hours and every once in a while, Ray will throw the following error:

Healthy:
 1 ray_head
 50 ray_worker_48
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 2353.0/2353.0 CPU
 0.00/19863.624 GiB memory
 0.00/8514.660 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 6155+ pending tasks/actors
2021-05-17 16:12:54,096	ERROR monitor.py:253 -- Error in monitor loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/usr/local/lib/python3.8/site-packages/ray/_private/monitor.py", line 189, in _run
    "autoscaler_report"] = self.autoscaler.summary()._asdict()
  File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 751, in summary
    all_node_ids = self.provider.non_terminated_nodes(tag_filters={})
  File "/usr/local/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 83, in non_terminated_nodes
    response = self.compute.instances().list(
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 920, in execute
    resp, content = _retry_request(
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
    raise exception
  File "/usr/local/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 209, in request
    self.credentials.before_request(self._request, method, uri, request_headers)
  File "/usr/local/lib/python3.8/site-packages/google/auth/credentials.py", line 133, in before_request
    self.refresh(request)
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 111, in refresh
    self._retrieve_info(request)
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 87, in _retrieve_info
    info = _metadata.get_service_account_info(
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 234, in get_service_account_info
    return get(request, path, params={"recursive": "true"})
  File "/usr/local/lib/python3.8/site-packages/google/auth/compute_engine/_metadata.py", line 150, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/usr/local/lib/python3.8/site-packages/google_auth_httplib2.py", line 119, in __call__
    response, data = self.http.request(
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1708, in request
    (response, content) = self._request(
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1424, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python3.8/site-packages/httplib2/__init__.py", line 1347, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python3.8/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1007, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 968, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

When this happens, the monitor logs stops logging, and I’m assuming the autoscaler crashes as well; this doesn’t seem to happen when using a smaller cluster (much less nodes). Since I want my workload to be done in hours instead of days, I need the CPU power, hence big clusters.

Does the autoscaler has a GCP node limit on which it can reliably work?

sansar · July 30, 2021, 8:31pm

Bumping this. What are limits to ray auto scaling please?

asm582 · July 30, 2021, 10:43pm

Hello,

If this helps, here is the link to ray’s benchmark, it says 250+ nodes:

github.com

ray-project/ray/blob/805b8a10a31258d386567e6aa603474aa0c19ec2/benchmarks/README.md

# Ray Scalability Envelope

## Distributed Benchmarks

All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.

| Dimension                                       | Quantity |
| ---------                                       | -------- |
| # nodes in cluster (with trivial task workload) | 250+     |
| # actors in cluster (with trivial workload)     | 10k+     |
| # simultaneously running tasks                  | 10k+     |
| # simultaneously running placement groups       | 1k+      |

## Object Store Benchmarks

| Dimension                           | Quantity |
| ---------                           | -------- |
| 1 GiB object broadcast (# of nodes) | 50+      |

This file has been truncated. show original

Topic		Replies	Views
[GCP] Ray Cluster on GCP scales up very slowly Ray Clusters	1	631	December 14, 2021
Cluster usage is not 100% rather 57% Ray Clusters	0	418	October 21, 2021
Can Ray support more than 1000 nodes?	1	536	February 2, 2022
Autoscaling is very slow and not working correctly Ray Clusters	6	613	April 30, 2021
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	509	October 10, 2024

Does Ray Autoscaler has a maximum numbers of nodes it can handle?

Related topics