Stuck trying to take down workers

KubernetesNodeProvider: calling delete_namespaced_pod fails multiple times with

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-9r759\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-9r759","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-hfshc\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-hfshc","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-drnt8\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-drnt8","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-f6xdm\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-f6xdm","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-zxt7j\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-zxt7j","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-cd2kb\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-cd2kb","kind":"pods"},"code":404}
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-cd2kb\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-cd2kb","kind":"pods"},"code":404}

Ray tries to take down the workes but gets stuck:

2021-03-17 16:34:08,696 INFO monitor.py:284 -- Monitor: Exception caught. Taking down workers...
2021-03-17 16:34:08,776 WARNING config.py:70 -- KubernetesNodeProvider: not checking if namespace 'ray' exists
2021-03-17 16:34:08,778 ERROR monitor.py:298 -- Monitor: Cleanup exception. Trying again...
2021-03-17 16:34:10,853 WARNING config.py:70 -- KubernetesNodeProvider: not checking if namespace 'ray' exists
2021-03-17 16:34:10,856 ERROR monitor.py:298 -- Monitor: Cleanup exception. Trying again...
2021-03-17 16:34:12,939 WARNING config.py:70 -- KubernetesNodeProvider: not checking if namespace 'ray' exists
2021-03-17 16:34:12,942 ERROR monitor.py:298 -- Monitor: Cleanup exception. Trying again...
2021-03-17 16:34:15,015 WARNING config.py:70 -- KubernetesNodeProvider: not checking if namespace 'ray' exists

This seems to have happened a few times, but I’ve only looked at the logs of the last occurrence.
It’s running with ray 1.1.0.

Local Ray >= 1.2.0 and rayproject/ray:nightly images in the K8s cluster are likely to give better results.
If you need better reproducibility rayproject/ray:<first six digits of commit sha> can be used to pin down the Ray commit.
Also Ray 1.3.0 should be out within the next couple of weeks!

We are using custom images for the worker-nodes. I updated them to use 00aceaae37e60b8e55d8997cc1a6777748a92558/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl while the head-node, which does not have any workers, is now set to use rayproject/ray:00acea-py38. We have been using our custom image on both worker-nodes and the head node until now. Is it ok to use different images on the workers and head node as long as they are using the same ray and python version? It did at least work fine when I did some tests on it.
I will now deploy the new setup and see if it makes it more stable.

Yes — with matching Python and Ray versions, it should be fine unless the driver script is on the head node and that driver tries to execute code on the head depending on the additional content of the custom images.

Let me know how it goes.

It has failed again. Here is the last part of the monitor.log:


2021-03-19 05:38:07,253	INFO monitor.py:182 -- :event_summary:Resized to 0 CPUs.
2021-03-19 05:38:12,386	ERROR autoscaler.py:270 -- StandardAutoscaler: ray-worker-cpu-m5z86: Terminating failed to setup/initialize node.
2021-03-19 05:38:12,393	ERROR autoscaler.py:142 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
    self._get_node_type(node_id) + " (launch failed).",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
    node_tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
    return self.request("GET", url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '0d2cb42d-1c2a-4981-9006-07449ddc528a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 12:38:12 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-m5z86\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-m5z86","kind":"pods"},"code":404}


2021-03-19 05:38:12,393	CRITICAL autoscaler.py:152 -- StandardAutoscaler: Too many errors, abort.
2021-03-19 05:38:12,394	ERROR monitor.py:243 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 154, in update
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
    self._get_node_type(node_id) + " (launch failed).",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
    node_tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
    return self.request("GET", url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '0d2cb42d-1c2a-4981-9006-07449ddc528a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 12:38:12 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-m5z86\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-m5z86","kind":"pods"},"code":404}

Hmm, that’s not good.

(1) Might seem a bit silly – but can you try rayproject/ray:cd89f0-py38 for the head node pod? That image contains this bug fix: https://github.com/ray-project/ray/pull/14773
I’m curious to see if it’s related.

(2) Could you provide more info on the config used to launch the Ray cluster?

(3) Could you provide some details on your K8s environment?

(1) I will test it out. Might not get any results until Monday.
(2) Here is the ray config:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 0

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 50

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 10.0

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Kubernetes resources that need to be configured for the autoscaler to be
# able to manage the Ray cluster. If any of the provided resources don't
# exist, the autoscaler will attempt to create them. If this fails, you may
# not have the required permissions and will have to request them to be
# created by your cluster administrator.
provider:
    type: kubernetes

    # Exposing external IP addresses for ray pods isn't currently supported.
    use_internal_ips: true

    # Namespace to use for all resources created.
    namespace: ray

    # ServiceAccount created by the autoscaler for the head node pod that it
    # runs in. If this field isn't provided, the head pod config below must
    # contain a user-created service account with the proper permissions.
    autoscaler_service_account:
        apiVersion: v1
        kind: ServiceAccount
        metadata:
            name: autoscaler

    # Role created by the autoscaler for the head node pod that it runs in.
    # If this field isn't provided, the role referenced in
    # autoscaler_role_binding must exist and have at least these permissions.
    autoscaler_role:
        kind: Role
        apiVersion: rbac.authorization.k8s.io/v1
        metadata:
            name: autoscaler
        rules:
        - apiGroups: [""]
          resources: ["pods", "pods/status", "pods/exec"]
          verbs: ["get", "watch", "list", "create", "delete", "patch"]

    # RoleBinding created by the autoscaler for the head node pod that it runs
    # in. If this field isn't provided, the head pod config below must contain
    # a user-created service account with the proper permissions.
    autoscaler_role_binding:
        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
            name: autoscaler
        subjects:
        - kind: ServiceAccount
          name: autoscaler
        roleRef:
            kind: Role
            name: autoscaler
            apiGroup: rbac.authorization.k8s.io

    services:
      # Service that maps to the head node of the Ray cluster.
      - apiVersion: v1
        kind: Service
        metadata:
            # NOTE: If you're running multiple Ray clusters with services
            # on one Kubernetes cluster, they must have unique service
            # names.
            name: ray-head
        spec:
            # This selector must match the head node pod's selector below.
            selector:
                component: ray-head
            ports:
                - protocol: TCP
                  port: 8000
                  targetPort: 8000
                  name: 8k
                - protocol: TCP
                  port: 6379
                  name: redis-primary
                  targetPort: 6379
                - protocol: TCP
                  port: 6380
                  targetPort: 6380
                  name: redis-shard-0
                - protocol: TCP
                  port: 6381
                  targetPort: 6381
                  name: redis-shard-1
                - protocol: TCP
                  port: 12345
                  targetPort: 12345
                  name: object-manager
                - protocol: TCP
                  port: 12346
                  targetPort: 12346
                  name: node-manager

      # Service that maps to the worker nodes of the Ray cluster.
      - apiVersion: v1
        kind: Service
        metadata:
            # NOTE: If you're running multiple Ray clusters with services
            # on one Kubernetes cluster, they must have unique service
            # names.
            name: ray-workers
        spec:
            # This selector must match the worker node pods' selector below.
            selector:
                component: ray-worker
            ports:
                - protocol: TCP
                  port: 8000
                  targetPort: 8000

available_node_types:
    head_node:
        node_config:
            metadata:
                # Automatically generates a name for the pod with this prefix.
                generateName: ray-head-

                # Must match the head node service selector above if a head node
                # service is required.
                labels:
                    component: ray-head
                tolerations:
                - key: ray-head
                  operator: Equal
                  value: "true"
                  effect: NoSchedule
        resources: {}
        min_workers: 0
        max_workers: 0
    node_pool_cpu:
        node_config:
            apiVersion: v1
            kind: Pod
            metadata:
                # Automatically generates a name for the pod with this prefix.
                generateName: ray-worker-cpu-

                # Must match the worker node service selector above if a worker node
                # service is required.
                labels:
                    component: ray-worker
            spec:
                tolerations:
                - key: cloud.google.com/gke-preemptible
                  operator: Equal
                  value: "true"
                  effect: NoSchedule
                - key: imerso-ray-worker
                  operator: Equal
                  value: "true"
                  effect: NoSchedule

                serviceAccountName: ray-prod

                # Worker nodes will be managed automatically by the head node, so
                # do not change the restart policy.
                restartPolicy: Never

                # This volume allocates shared memory for Ray to use for its plasma
                # object store. If you do not provide this, Ray will fall back to
                # /tmp which cause slowdowns if is not a shared memory volume.
                volumes:
                - name: dshm
                  emptyDir:
                      medium: Memory
                - name: filestore-ray
                  persistentVolumeClaim:
                    claimName: fileserver-ray-claim
                    readOnly: false

                containers:
                - name: ray-node
                  imagePullPolicy: Always
                  # You are free (and encouraged) to use your own container image,
                  # but it should have the following installed:
                  #   - rsync (used for `ray rsync` commands and file mounts)
                  image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
                  # Do not change this command - it keeps the pod alive until it is
                  # explicitly killed.
                  command: ["/bin/bash", "-c", "--"]
                  args: ["trap : TERM INT; sudo /usr/sbin/ldconfig.real; sleep infinity & wait;"]
                  ports:
                      - containerPort: 12345 # Ray internal communication.
                      - containerPort: 12346 # Ray internal communication.

                  # This volume allocates shared memory for Ray to use for its plasma
                  # object store. If you do not provide this, Ray will fall back to
                  # /tmp which cause slowdowns if is not a shared memory volume.
                  volumeMounts:
                      - mountPath: /dev/shm
                        name: dshm
                      - mountPath: /filestore
                        name: filestore-ray
                  resources:
                      requests:
                          cpu: 7
                          memory: 25Gi
                  env:
                      # This is used in the head_start_ray_commands below so that
                      # Ray can spawn the correct number of processes. Omitting this
                      # may lead to degraded performance.
                      - name: MY_CPU_REQUEST
                        valueFrom:
                            resourceFieldRef:
                                resource: requests.cpu
        resources: {"CPU": 7, "memory": 26843545600} # Memory-unit ~= 52 MB
        min_workers: 0
        max_workers: 50
    node_pool_gpu:
        node_config:
            apiVersion: v1
            kind: Pod
            metadata:
                # Automatically generates a name for the pod with this prefix.
                generateName: ray-worker-gpu-

                # Must match the worker node service selector above if a worker node
                # service is required.
                labels:
                    component: ray-worker
            spec:
                tolerations:
                - key: cloud.google.com/gke-preemptible
                  operator: Equal
                  value: "true"
                  effect: NoSchedule
                - key: imerso-ray-worker
                  operator: Equal
                  value: "true"
                  effect: NoSchedule

                serviceAccountName: ray-prod

                # Worker nodes will be managed automatically by the head node, so
                # do not change the restart policy.
                restartPolicy: Never

                # This volume allocates shared memory for Ray to use for its plasma
                # object store. If you do not provide this, Ray will fall back to
                # /tmp which cause slowdowns if is not a shared memory volume.
                volumes:
                - name: dshm
                  emptyDir:
                      medium: Memory
                - name: filestore-ray
                  persistentVolumeClaim:
                    claimName: fileserver-ray-claim
                    readOnly: false

                containers:
                - name: ray-node
                  imagePullPolicy: Always
                  # You are free (and encouraged) to use your own container image,
                  # but it should have the following installed:
                  #   - rsync (used for `ray rsync` commands and file mounts)
                  image: eu.gcr.io/imerso-3dscanner-backend/imerso-ray:${VERSION_TAG}
                  # Do not change this command - it keeps the pod alive until it is
                  # explicitly killed.
                  command: ["/bin/bash", "-c", "--"]
                  args: ["trap : TERM INT; sudo /usr/sbin/ldconfig.real; sleep infinity & wait;"]
                  ports:
                      - containerPort: 12345 # Ray internal communication.
                      - containerPort: 12346 # Ray internal communication.

                  # This volume allocates shared memory for Ray to use for its plasma
                  # object store. If you do not provide this, Ray will fall back to
                  # /tmp which cause slowdowns if is not a shared memory volume.
                  volumeMounts:
                      - mountPath: /dev/shm
                        name: dshm
                      - mountPath: /filestore
                        name: filestore-ray
                  resources:
                      requests:
                          cpu: 7
                          memory: 25Gi
                      limits:
                          nvidia.com/gpu: 1
                  env:
                      # This is used in the head_start_ray_commands below so that
                      # Ray can spawn the correct number of processes. Omitting this
                      # may lead to degraded performance.
                      - name: MY_CPU_REQUEST
                        valueFrom:
                            resourceFieldRef:
                                resource: requests.cpu
        resources: {"CPU": 7, "memory": 26843545600, "GPU": 1, "accelerator_type:T4": 1} # Memory-unit ~= 52 MB
        min_workers: 0
        max_workers: 20

head_node_type: head_node     
worker_default_node_type: node_pool_cpu


# Kubernetes pod config for the head node pod.
head_node:
    apiVersion: v1
    kind: Pod
    spec:
        # Change this if you altered the autoscaler_service_account above
        # or want to provide your own.
        serviceAccountName: autoscaler

        # Restarting the head node automatically is not currently supported.
        # If the head node goes down, `ray up` must be run again.
        restartPolicy: Never

        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: filestore-ray
          persistentVolumeClaim:
            claimName: fileserver-ray-claim
            readOnly: false

        containers:
        - name: ray-node
          imagePullPolicy: Always
          # You are free (and encouraged) to use your own container image,
          # but it should have the following installed:
          #   - rsync (used for `ray rsync` commands and file mounts)
          #   - screen (used for `ray attach`)
          #   - kubectl (used by the autoscaler to manage worker pods)
          image: rayproject/ray:cd89f0-py38
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          ports:
              - containerPort: 6379 # Redis port.
              - containerPort: 6380 # Redis port.
              - containerPort: 6381 # Redis port.
              - containerPort: 12345 # Ray internal communication.
              - containerPort: 12346 # Ray internal communication.

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /filestore
                name: filestore-ray
          resources:
              requests:
                  cpu: 500m
                  memory: 5Gi
              limits:
                  memory: 5Gi
          env:
              # This is used in the head_start_ray_commands below so that
              # Ray can spawn the correct number of processes. Omitting this
              # may lead to degraded performance.
              - name: MY_CPU_REQUEST
                valueFrom:
                    resourceFieldRef:
                        resource: requests.cpu


# Kubernetes pod config for worker node pods.
worker_nodes: {}

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "~/path1/on/remote/machine": "/path1/on/local/machine",
#    "~/path2/on/remote/machine": "/path2/on/local/machine",
}
# Note that the container images in this example have a non-root user.
# To avoid permissions issues, we recommend mounting into a subdirectory of home (~).

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down.
# This is not supported on kubernetes.
# rsync_exclude: []

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
# This is not supported on kubernetes.
# rsync_filter: []

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
# Note webui-host is set to 0.0.0.0 so that kubernetes can port forward.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --num-cpus=0 --object-store-memory 1073741824 --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=ray-head.ray.svc.cluster.local:6379 --object-store-memory 104857600 --object-manager-port=8076


(3) Here are the node pools:


gcloud container node-pools create e2-standard-2 \
  --cluster=imerso-services \
  --machine-type=e2-standard-2 \
  --enable-autoscaling\
  --num-nodes=0\
  --min-nodes 0\
  --max-nodes 20\
  --zone europe-west2-a

gcloud container node-pools create ray-worker \
  --cluster=imerso-services \
  --machine-type=n1-standard-8 \
  --preemptible\
  --node-taints cloud.google.com/gke-preemptible="true":NoSchedule,imerso-ray-worker="true":NoSchedule\
  --enable-autoscaling\
  --num-nodes=0\
  --min-nodes 0\
  --max-nodes 50\
  --zone europe-west2-a\
  --node-labels=imerso-ray-worker=true


gcloud container node-pools create ray-gpu-worker \
  --cluster=imerso-services \
  --machine-type=n1-standard-8 \
  --accelerator type=nvidia-tesla-t4,count=1\
  --preemptible\
  --node-taints cloud.google.com/gke-preemptible="true":NoSchedule,imerso-ray-worker="true":NoSchedule\
  --enable-autoscaling\
  --num-nodes=0\
  --min-nodes 0\
  --max-nodes 20\
  --zone europe-west2-a\
  --node-labels=imerso-ray-gpu-worker=true

The workers are using preemptible nodes. Anything else about the K8s environment that is relevant?

(1) I managed to get it to fail. The error seems pretty similar:

2021-03-19 12:14:08,467 ERROR autoscaler.py:270 -- StandardAutoscaler: ray-worker-cpu-lrwzl: Terminating failed to setup/initialize node.
2021-03-19 12:14:08,470 ERROR autoscaler.py:142 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
    self._get_node_type(node_id) + " (launch failed).",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
    node_tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
    return self.request("GET", url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a0de5fc8-1ebe-41f2-8d7a-1de70ec65555', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 19:14:08 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-lrwzl\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-lrwzl","kind":"pods"},"code":404}


2021-03-19 12:14:08,472 CRITICAL autoscaler.py:152 -- StandardAutoscaler: Too many errors, abort.
2021-03-19 12:14:08,473 ERROR monitor.py:243 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 154, in update
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
    self._get_node_type(node_id) + " (launch failed).",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
    node_tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
    return self.request("GET", url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a0de5fc8-1ebe-41f2-8d7a-1de70ec65555', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 19:14:08 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-lrwzl\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-lrwzl","kind":"pods"},"code":404}

Thanks for the details!

The fact that the nodes are preemptible is a potentially relevant detail – and definitely we want to make sure the Ray autoscaler handles K8s node preemption correctly.

One more question – at what point in your workflow does this error occur?
Is it after the Ray cluster has been running for a while and one of the K8s nodes are interrupted?

I believe it happens when it’s scaling down. Is there a good way I can share the log, it’s too big to write here.

The log can be found here Ray monitor.log - c2db57a0

I thought about adding some exception handling on core_api().delete_namespaced_pod(node_id, self.namespace) in python/ray/autoscaler/_private/kubernetes/node_provider.py

    def terminate_node(self, node_id):
        logger.info(log_prefix + "calling delete_namespaced_pod")
        core_api().delete_namespaced_pod(node_id, self.namespace)
        try:
            core_api().delete_namespaced_service(node_id, self.namespace)
        except ApiException:
            pass
        try:
            extensions_beta_api().delete_namespaced_ingress(
                node_id,
                self.namespace,
            )
        except ApiException:
            pass

Something like:

    def terminate_node(self, node_id):
        logger.info(log_prefix + "calling delete_namespaced_pod")
        try:
            core_api().delete_namespaced_pod(node_id, self.namespace)
        except ApiException as e:
            if e.status != 404:
                raise

        try:
            core_api().delete_namespaced_service(node_id, self.namespace)
        except ApiException:
            pass
        try:
            extensions_beta_api().delete_namespaced_ingress(
                node_id,
                self.namespace,
            )
        except ApiException:
            pass

Would that make any sense, or is the main issues somewhere else?

I can also mention that running ray status gives the following, when it’s in the failed state:

kubectl exec -it -n ray ray-head-vmvr9 -- bash
ray (base) ray@ray-head-vmvr9:~$ ray status
======== Autoscaler status: 2021-03-19 13:53:36.979562 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head_node
 1 node_pool_cpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/49.0 CPU
 0.00/178.458 GiB memory
 0.00/1.684 GiB object_store_memory

Demands:
 (no resource demands)
The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 274, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/monitor.py", line 177, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 154, in update
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 140, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 274, in _update
    self._get_node_type(node_id) + " (launch failed).",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 601, in _get_node_type
    node_tags = self.provider.node_tags(node_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 65, in node_tags
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22785, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 22880, in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 239, in GET
    return self.request("GET", url,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a0de5fc8-1ebe-41f2-8d7a-1de70ec65555', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 19 Mar 2021 19:14:08 GMT', 'Content-Length': '208'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"ray-worker-cpu-lrwzl\" not found","reason":"NotFound","details":{"name":"ray-worker-cpu-lrwzl","kind":"pods"},"code":404}

Thanks for the logs!
Adding some exception handling there might be useful. I’ll look into reproducing the problem and finding the root cause.

Thanks, I appreciate the help!

Scrolling through the logs, I see these lines occur in order

2021-03-19 12:13:34,007 INFO autoscaler.py:198 -- StandardAutoscaler: ray-worker-cpu-lrwzl: Terminating idle node.

and then

2021-03-19 12:13:41,392 WARNING autoscaler.py:571 -- StandardAutoscaler: ray-worker-cpu-lrwzl: No recent heartbeat, restarting Ray to recover...

which leads to an exception.

So the autoscaler terminated the node and then for some reason tried to recover it.
Investigating why…

I’m able to reproduce the issue, so that’s a good start…
Tracking it here:

1 Like

Nice! I’m hoping it will be ok to fix. :crossed_fingers:

There’s an issue in the way the autoscaler detected which Ray pods are active.
The linked PR fixes the problem. Should be merged soon.

1 Like

:clap:
Thanks! Looking forward to using the new version :smiley: