Testing autoscaler

Hi, I setup an autoscaler ray on k8s, with three pods. 1 head, 2 worker.
I used the default yaml from master branch.

I try to see if autoscaler work with the following script:

bundle={"CPU":4}
pg = placement_group([bundle],strategy="STRICT_PACK")
ray.get(pg.ready())

The program stuck there, and by checking pods status, no new pod was generated.

Here is the ray.cluster_resource()

{'CPU': 3.0, 'bar': 2.0, 'object_store_memory': 924901784.0, 'memory': 1127428914.0, 'node:10.23.129.66': 1.0, 'foo': 2.0, 'node:10.23.129.2': 1.0, 'node:10.23.128.130': 1.0}

@valiantljk what is your k8s YAML?

cc @Dmitri

After a few minutes, the program aborted with the following error:

2021-03-16 13:55:05,471	ERROR worker.py:936 -- print_logs: Connection closed by server.
2021-03-16 13:55:05,472	ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
Aborted

And I relogged in the head node, and tried ray status, got another error

 ray status
======== Autoscaler status: 2021-03-16 14:16:35.705243 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-node
 2 worker-node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/3.0 CPU
 0.0/2.0 bar
 0.0/2.0 foo
 0.00/1.904 GiB memory
 0.00/0.857 GiB object_store_memory

Demands:
 (no resource demands)
The autoscaler failed with the following error:
Terminated with signal 15
  File "/home/ray/anaconda3/bin/ray-operator", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 154, in main
    handle_event(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 113, in handle_event
    cluster_action(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 127, in cluster_action
    ray_clusters[cluster_name].create_or_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 50, in create_or_update
    self.do_in_subprocess(self._create_or_update)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 40, in do_in_subprocess
    self.subprocess.start()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 54, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 79, in start_monitor
    self.mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 135, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 186, in _update
    if (self._keep_min_worker_of_node_type(node_id, node_type_counts)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 414, in _keep_min_worker_of_node_type
    tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 64, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22894, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 216, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 76, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ray/anaconda3/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)

I did not change the yaml from master branch:

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: example-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-head-
      spec:
        restartPolicy: Never

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 8265  # Used by Ray Dashboard

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi
  - name: worker-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 3
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    rayResources: {"foo": 1, "bar": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: example-cluster-ray-worker-
      spec:
        restartPolicy: Never
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:nightly
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 1000m
              memory: 512Mi
            limits:
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 512Mi
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

The placement group strategy will not succeed because there are no pod types specified with 4 CPUs available.
(“worker_node” and “head_node” types are both limited to 1 CPU, so strict pack with 4 CPUs won’t fit anywhere)

This failure mode is spectacularly bad. Could you please post an issue on the Ray github so that we can track this?

could you suggest a wise way to test autoscaler, I’d like to see how pods are scaled.

Running the following code on the Ray head node will result in an additional pod launching in addition to the three initial pods

import time
import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def f():
    time.sleep(60)

ray.get([f.remote() for _ in range(4)])

the global maxWorkers field and “worker-node”-specific maxWorkers fields are both set to 3 in the above config, which leaves enough room for the autoscaler to spin up an additional 1-CPU worker pod to handle all 4 tasks.

1 Like

num_cpus=1 above is optional – ray.remote assumes a requirement of one cpu by default.

1 Like

Makes sense, just tried, still got errors:

2021-03-16 15:05:18,285	INFO worker.py:664 -- Connecting to existing Ray cluster at address: 10.23.128.130:6379
2021-03-16 15:05:18,354	WARNING worker.py:1066 -- The autoscaler failed with the following error:
Terminated with signal 15
  File "/home/ray/anaconda3/bin/ray-operator", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 154, in main
    handle_event(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 113, in handle_event
    cluster_action(event_type, cluster_cr, cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 127, in cluster_action
    ray_clusters[cluster_name].create_or_update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 50, in create_or_update
    self.do_in_subprocess(self._create_or_update)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 40, in do_in_subprocess
    self.subprocess.start()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 54, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 79, in start_monitor
    self.mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 135, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 186, in _update
    if (self._keep_min_worker_of_node_type(node_id, node_type_counts)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 414, in _keep_min_worker_of_node_type
    tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 64, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 22894, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 216, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 76, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/home/ray/anaconda3/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ray/anaconda3/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/home/ray/anaconda3/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)

And If I try the following:

import time
import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def f():
    time.sleep(60)

ray.get([f.remote() for _ in range(2)])

it returns same error.

oof, a Kubernetes API error…

hmm… could you tear everything down (operator and Ray cluster), relaunch and test the ray script again?

It works! Thanks @Dmitri

I think the issue was the gke’s autopilot mode. I re-launched a standard mode k8s cluster. and seems working now.

Cool — glad it works!