Testing autoscaler

valiantljk · March 16, 2021, 8:48pm

Hi, I setup an autoscaler ray on k8s, with three pods. 1 head, 2 worker.
I used the default yaml from master branch.

I try to see if autoscaler work with the following script:

bundle={"CPU":4}
pg = placement_group([bundle],strategy="STRICT_PACK")
ray.get(pg.ready())

The program stuck there, and by checking pods status, no new pod was generated.

Here is the ray.cluster_resource()

{'CPU': 3.0, 'bar': 2.0, 'object_store_memory': 924901784.0, 'memory': 1127428914.0, 'node:10.23.129.66': 1.0, 'foo': 2.0, 'node:10.23.129.2': 1.0, 'node:10.23.128.130': 1.0}

ijrsvt · March 16, 2021, 9:18pm

@valiantljk what is your k8s YAML?

cc @Dmitri

valiantljk · March 16, 2021, 9:18pm

After a few minutes, the program aborted with the following error:

2021-03-16 13:55:05,471	ERROR worker.py:936 -- print_logs: Connection closed by server.
2021-03-16 13:55:05,472	ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
Aborted

And I relogged in the head node, and tried ray status, got another error

 ray status
======== Autoscaler status: 2021-03-16 14:16:35.705243 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-node
 2 worker-node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/3.0 CPU
 0.0/2.0 bar
 0.0/2.0 foo
 0.00/1.904 GiB memory
 0.00/0.857 GiB object_store_memory

Demands:
 (no resource demands)
The autoscaler failed with the following error:
Terminated with signal 15
  File "/home/ray/anaconda3/bin/ray-operator", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 154, in main
    handle_event(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 113, in handle_event
    cluster_action(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 127, in cluster_action
    ray_clusters[cluster_name].create_or_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 50, in create_or_update
    self.do_in_subprocess(self._create_or_update)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 40, in do_in_subprocess
    self.subprocess.start()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 54, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 79, in start_monitor
    self.mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 135, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 186, in _update
    if (self._keep_min_worker_of_node_type(node_id, node_type_counts)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 414, in _keep_min_worker_of_node_type
    tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 64, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22894, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 216, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 76, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ray/anaconda3/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)

valiantljk · March 16, 2021, 9:19pm

I did not change the yaml from master branch:

valiantljk · March 16, 2021, 9:21pm

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: example-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-head-
      spec:
        restartPolicy: Never

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 8265  # Used by Ray Dashboard

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi
  - name: worker-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 3
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    rayResources: {"foo": 1, "bar": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-worker-
      spec:
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

Dmitri · March 16, 2021, 9:38pm

The placement group strategy will not succeed because there are no pod types specified with 4 CPUs available.
(“worker_node” and “head_node” types are both limited to 1 CPU, so strict pack with 4 CPUs won’t fit anywhere)

This failure mode is spectacularly bad. Could you please post an issue on the Ray github so that we can track this?

valiantljk · March 16, 2021, 9:42pm

could you suggest a wise way to test autoscaler, I’d like to see how pods are scaled.

Dmitri · March 16, 2021, 10:03pm

Running the following code on the Ray head node will result in an additional pod launching in addition to the three initial pods

import time
import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def f():
    time.sleep(60)

ray.get([f.remote() for _ in range(4)])

the global maxWorkers field and “worker-node”-specific maxWorkers fields are both set to 3 in the above config, which leaves enough room for the autoscaler to spin up an additional 1-CPU worker pod to handle all 4 tasks.

Dmitri · March 16, 2021, 10:05pm

num_cpus=1 above is optional – ray.remote assumes a requirement of one cpu by default.

valiantljk · March 16, 2021, 10:05pm

Makes sense, just tried, still got errors:

2021-03-16 15:05:18,285	INFO worker.py:664 -- Connecting to existing Ray cluster at address: 10.23.128.130:6379
2021-03-16 15:05:18,354	WARNING worker.py:1066 -- The autoscaler failed with the following error:
Terminated with signal 15
  File "/home/ray/anaconda3/bin/ray-operator", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 154, in main
    handle_event(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 113, in handle_event
    cluster_action(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 127, in cluster_action
    ray_clusters[cluster_name].create_or_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 50, in create_or_update
    self.do_in_subprocess(self._create_or_update)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 40, in do_in_subprocess
    self.subprocess.start()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 54, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 79, in start_monitor
    self.mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 135, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 186, in _update
    if (self._keep_min_worker_of_node_type(node_id, node_type_counts)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 414, in _keep_min_worker_of_node_type
    tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 64, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22894, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 216, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 76, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ray/anaconda3/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)

valiantljk · March 16, 2021, 10:06pm

And If I try the following:

import time
import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def f():
    time.sleep(60)

ray.get([f.remote() for _ in range(2)])

it returns same error.

Dmitri · March 16, 2021, 10:09pm

oof, a Kubernetes API error…

Dmitri · March 16, 2021, 10:17pm

hmm… could you tear everything down (operator and Ray cluster), relaunch and test the ray script again?

valiantljk · March 16, 2021, 11:14pm

It works! Thanks @Dmitri

valiantljk · March 16, 2021, 11:15pm

I think the issue was the gke’s autopilot mode. I re-launched a standard mode k8s cluster. and seems working now.

Dmitri · March 16, 2021, 11:26pm

Cool — glad it works!

Topic		Replies	Views
Autoscaler does not seem to watch head node Kubernetes	5	736	March 26, 2021
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	729	November 20, 2023
Stuck trying to take down workers Kubernetes	22	2348	March 29, 2021
Autoscaler scales cluster up and down all the time RLlib	6	458	May 12, 2021
Ray Worker pod stuck at init stage and unable to be created Ray Clusters	8	687	August 7, 2024

Testing autoscaler

Related topics