Autoscaler does not seem to watch head node

blublinsky · March 26, 2021, 4:26pm

I have installed k8 Ray cluster using Ray up.
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well?
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.

Are those bugs?

blublinsky · March 26, 2021, 5:06pm

Also, ray down does not remove the service associated with head pod. Also Ray up sometimes does create this service, but sometimes it does not.

eoakes · March 26, 2021, 5:54pm

@Dmitri could you clarify the behavior here please?

Dmitri · March 26, 2021, 6:27pm

Currently, Ray down does not delete the service and Ray up creates the service if it’s not already present.
There’s an issue open to change this behavior so that Ray down deletes the service

github.com/ray-project/ray

[autoscaler] [kubernetes] Calling ray down does not remove Kubernetes services

opened 04:48AM - 16 Mar 21 UTC

closed 08:00PM - 19 Nov 22 UTC

tgaddair

bug P2 k8s

When creating a cluster on Kubernetes, Ray will allocate a service routing traff…ic to the head node when the user adds this to their cluster config: ``` services: - apiVersion: v1 kind: Service metadata: name: local-cluster-ray-head spec: selector: component: local-cluster-ray-head ports: - name: client protocol: TCP port: 10001 targetPort: 10001 - name: dashboard protocol: TCP port: 8265 targetPort: 8265 ``` However, when calling `ray down cluster.yaml`, this service (unlike the pods) will not be removed. The expected behavior is that all resources created by `ray up` should be properly cleaned up after calling `ray down`. cc @richardliaw

Dmitri · March 26, 2021, 6:35pm

This so

Ray down should remove all pods created by Ray up, regardless of their status. If this didn’t work for you, it would be great if you could file an issue on the Ray github with bug reproduction details!

Unfortunately, we currently don’t implement error handling to deal with Ray head failure. In fact, when launching clusters on Kubernetes with the cluster launcher (ray up), the autoscaler runs on the head node so there’s no way for the autoscaler to recover the head node.
In the future, the Ray K8s Operator will implement sensible logic to deal with head failure –
this issue is tracked here

github.com/ray-project/ray

[kubernetes][operator] setupCommands doesn't work for header node when using ray operator

opened 06:21AM - 17 Mar 21 UTC

closed 03:45PM - 29 Apr 21 UTC

kakaxilyp

bug triage

### What is the problem? - Ray version: 1.2.0 - OS version: Ubuntu 20.04.2 LTS… (Focal Fossa) - Python version: 3.7.7 ### Reproduction (REQUIRED) Example cluster config for ray operator: ``` apiVersion: cluster.ray.io/v1 kind: RayCluster metadata: namespace: ray name: kf-prod01 spec: # The maximum number of workers nodes to launch in addition to the head node. maxWorkers: 3 # The autoscaler will scale up the cluster faster with higher upscaling speed. # E.g., if the task requires adding more nodes then autoscaler will gradually # scale up the cluster in chunks of upscaling_speed*currently_running_nodes. # This number should be > 0. upscalingSpeed: 1.0 # If a node is idle for this many minutes, it will be removed. idleTimeoutMinutes: 5 # Specify the pod type for the ray head node (as configured below). headPodType: head-node # Specify the allowed pod types for this ray cluster and the resources they provide. podTypes: - name: head-node # Minimum number of Ray workers of this Pod type. minWorkers: 0 # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers. maxWorkers: 0 podConfig: apiVersion: v1 kind: Pod metadata: # Automatically generates a name for the pod with this prefix. generateName: ray-head- spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-nodepool: default-pool # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumes: - name: dshm emptyDir: medium: Memory containers: - name: ray-node imagePullPolicy: Always image: rayproject/ray:1.2.0 # Do not change this command - it keeps the pod alive until it is # explicitly killed. command: ["/bin/bash", "-c", "--"] args: ['trap : TERM INT; sleep infinity & wait;'] ports: - containerPort: 6379 # Redis port - containerPort: 10001 # Used by Ray Client - containerPort: 12345 # Ray internal communication. - containerPort: 12346 # Ray internal communication. - containerPort: 8265 # Used by Ray Dashboard # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumeMounts: - mountPath: /dev/shm name: dshm resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 2000m # The maximum memory that this pod is allowed to use. The # limit will be detected by ray and split to use 10% for # redis, 30% for the shared memory object store, and the # rest for application memory. If this limit is not set and # the object store size is not set manually, ray will # allocate a very large object store in each pod that may # cause problems for other pods. memory: 4Gi setupCommands: - pip install pipdate==0.5.2 - name: worker-node # Minimum number of Ray workers of this Pod type. minWorkers: 2 # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers. maxWorkers: 3 # User-specified custom resources for use by Ray. # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.) # rayResources: {"foo": 1, "bar": 1} podConfig: apiVersion: v1 kind: Pod metadata: # Automatically generates a name for the pod with this prefix. generateName: ray-worker- spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-nodepool: cpu-worker-pool01 volumes: - name: dshm emptyDir: medium: Memory containers: - name: ray-node imagePullPolicy: Always image: rayproject/ray:1.2.0 command: ["/bin/bash", "-c", "--"] args: ["trap : TERM INT; sleep infinity & wait;"] # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumeMounts: - mountPath: /dev/shm name: dshm resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 2000m # The maximum memory that this pod is allowed to use. The # limit will be detected by ray and split to use 10% for # redis, 30% for the shared memory object store, and the # rest for application memory. If this limit is not set and # the object store size is not set manually, ray will # allocate a very large object store in each pod that may # cause problems for other pods. memory: 4Gi setupCommands: - pip install --ignore-installed ruamel.yaml==0.16.12 - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0 #- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl #- pip install --no-cache-dir pytorch-lightning==1.1.8 # Commands to start Ray on the head node. You don't need to change this. # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward. headStartRayCommands: - ray stop - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001 # Commands to start Ray on worker nodes. You don't need to change this. workerStartRayCommands: - ray stop - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346 --- apiVersion: v1 kind: Service metadata: namespace: ray name: ray-head spec: # This selector must match the head node pod's selector. selector: component: kf-prod01-ray-head ports: - name: client protocol: TCP port: 10001 targetPort: 10001 # Redis ports. - name: redis-primary port: 6379 targetPort: 6379 # Ray internal communication ports. - name: object-manager protocol: TCP port: 12345 targetPort: 12345 - name: node-manager protocol: TCP port: 12346 targetPort: 12346 --- apiVersion: v1 kind: Service metadata: namespace: ray name: ray-dashboard spec: type: LoadBalancer # This selector must match the head node pod's selector. selector: component: kf-prod01-ray-head ports: - name: dashboard protocol: TCP port: 8265 targetPort: 8265 ``` Apply the cluster config by: ``` kubectl -n ray apply -f cluster.yaml ``` Then check the translated cluster config inside the ray operator: ``` (base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml auth: {} available_node_types: head-node: max_workers: 0 min_workers: 0 node_config: apiVersion: v1 kind: Pod metadata: generateName: ray-head- labels: component: kf-prod01-ray-head ownerReferences: - &id001 apiVersion: cluster.ray.io/v1 blockOwnerDeletion: true controller: true kind: RayCluster name: kf-prod01 uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047 spec: containers: - args: - 'trap : TERM INT; sleep infinity & wait;' command: - /bin/bash - -c - -- image: rayproject/ray:1.2.0 imagePullPolicy: Always name: ray-node ports: - containerPort: 6379 protocol: TCP - containerPort: 10001 protocol: TCP - containerPort: 12345 protocol: TCP - containerPort: 12346 protocol: TCP - containerPort: 8265 protocol: TCP resources: limits: cpu: 2000m memory: 4Gi requests: cpu: 2000m memory: 4Gi volumeMounts: - mountPath: /dev/shm name: dshm nodeSelector: cloud.google.com/gke-nodepool: default-pool restartPolicy: Never volumes: - emptyDir: medium: Memory name: dshm resources: CPU: 2 worker_setup_commands: - pip install pipdate==0.5.2 worker-node: max_workers: 3 min_workers: 2 node_config: apiVersion: v1 kind: Pod metadata: generateName: ray-worker- ownerReferences: - *id001 spec: containers: - args: - 'trap : TERM INT; sleep infinity & wait;' command: - /bin/bash - -c - -- image: rayproject/ray:1.2.0 imagePullPolicy: Always name: ray-node resources: limits: cpu: 2000m memory: 4Gi requests: cpu: 2000m memory: 4Gi volumeMounts: - mountPath: /dev/shm name: dshm nodeSelector: cloud.google.com/gke-nodepool: cpu-worker-pool01 restartPolicy: Never volumes: - emptyDir: medium: Memory name: dshm resources: CPU: 2 worker_setup_commands: - pip install --ignore-installed ruamel.yaml==0.16.12 - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0 cluster_name: kf-prod01 cluster_synced_files: [] file_mounts: {} file_mounts_sync_continuously: false head_node: {} head_node_type: head-node head_setup_commands: [] head_start_ray_commands: - ray stop - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001 idle_timeout_minutes: 5 initialization_commands: [] max_workers: 3 provider: namespace: ray services: - apiVersion: v1 kind: Service metadata: name: kf-prod01-ray-head namespace: ray spec: ports: - name: client port: 10001 protocol: TCP targetPort: 10001 - name: dashboard port: 8265 protocol: TCP targetPort: 8265 selector: component: kf-prod01-ray-head type: kubernetes use_internal_ips: true setup_commands: [] upscaling_speed: 1 worker_nodes: {} worker_setup_commands: [] worker_start_ray_commands: - ray stop - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346 ``` Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`. Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node.

The Ray operator runs the autoscaler in a pod separate from the Ray cluster.

Here’s the documentation on the cluster launcher and operator.

github.com/ray-project/ray

[kubernetes][operator] setupCommands doesn't work for header node when using ray operator

opened 06:21AM - 17 Mar 21 UTC

closed 03:45PM - 29 Apr 21 UTC

kakaxilyp

bug triage

### What is the problem? - Ray version: 1.2.0 - OS version: Ubuntu 20.04.2 LTS… (Focal Fossa) - Python version: 3.7.7 ### Reproduction (REQUIRED) Example cluster config for ray operator: ``` apiVersion: cluster.ray.io/v1 kind: RayCluster metadata: namespace: ray name: kf-prod01 spec: # The maximum number of workers nodes to launch in addition to the head node. maxWorkers: 3 # The autoscaler will scale up the cluster faster with higher upscaling speed. # E.g., if the task requires adding more nodes then autoscaler will gradually # scale up the cluster in chunks of upscaling_speed*currently_running_nodes. # This number should be > 0. upscalingSpeed: 1.0 # If a node is idle for this many minutes, it will be removed. idleTimeoutMinutes: 5 # Specify the pod type for the ray head node (as configured below). headPodType: head-node # Specify the allowed pod types for this ray cluster and the resources they provide. podTypes: - name: head-node # Minimum number of Ray workers of this Pod type. minWorkers: 0 # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers. maxWorkers: 0 podConfig: apiVersion: v1 kind: Pod metadata: # Automatically generates a name for the pod with this prefix. generateName: ray-head- spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-nodepool: default-pool # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumes: - name: dshm emptyDir: medium: Memory containers: - name: ray-node imagePullPolicy: Always image: rayproject/ray:1.2.0 # Do not change this command - it keeps the pod alive until it is # explicitly killed. command: ["/bin/bash", "-c", "--"] args: ['trap : TERM INT; sleep infinity & wait;'] ports: - containerPort: 6379 # Redis port - containerPort: 10001 # Used by Ray Client - containerPort: 12345 # Ray internal communication. - containerPort: 12346 # Ray internal communication. - containerPort: 8265 # Used by Ray Dashboard # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumeMounts: - mountPath: /dev/shm name: dshm resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 2000m # The maximum memory that this pod is allowed to use. The # limit will be detected by ray and split to use 10% for # redis, 30% for the shared memory object store, and the # rest for application memory. If this limit is not set and # the object store size is not set manually, ray will # allocate a very large object store in each pod that may # cause problems for other pods. memory: 4Gi setupCommands: - pip install pipdate==0.5.2 - name: worker-node # Minimum number of Ray workers of this Pod type. minWorkers: 2 # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers. maxWorkers: 3 # User-specified custom resources for use by Ray. # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.) # rayResources: {"foo": 1, "bar": 1} podConfig: apiVersion: v1 kind: Pod metadata: # Automatically generates a name for the pod with this prefix. generateName: ray-worker- spec: restartPolicy: Never nodeSelector: cloud.google.com/gke-nodepool: cpu-worker-pool01 volumes: - name: dshm emptyDir: medium: Memory containers: - name: ray-node imagePullPolicy: Always image: rayproject/ray:1.2.0 command: ["/bin/bash", "-c", "--"] args: ["trap : TERM INT; sleep infinity & wait;"] # This volume allocates shared memory for Ray to use for its plasma # object store. If you do not provide this, Ray will fall back to # /tmp which cause slowdowns if is not a shared memory volume. volumeMounts: - mountPath: /dev/shm name: dshm resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 2000m # The maximum memory that this pod is allowed to use. The # limit will be detected by ray and split to use 10% for # redis, 30% for the shared memory object store, and the # rest for application memory. If this limit is not set and # the object store size is not set manually, ray will # allocate a very large object store in each pod that may # cause problems for other pods. memory: 4Gi setupCommands: - pip install --ignore-installed ruamel.yaml==0.16.12 - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0 #- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl #- pip install --no-cache-dir pytorch-lightning==1.1.8 # Commands to start Ray on the head node. You don't need to change this. # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward. headStartRayCommands: - ray stop - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001 # Commands to start Ray on worker nodes. You don't need to change this. workerStartRayCommands: - ray stop - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346 --- apiVersion: v1 kind: Service metadata: namespace: ray name: ray-head spec: # This selector must match the head node pod's selector. selector: component: kf-prod01-ray-head ports: - name: client protocol: TCP port: 10001 targetPort: 10001 # Redis ports. - name: redis-primary port: 6379 targetPort: 6379 # Ray internal communication ports. - name: object-manager protocol: TCP port: 12345 targetPort: 12345 - name: node-manager protocol: TCP port: 12346 targetPort: 12346 --- apiVersion: v1 kind: Service metadata: namespace: ray name: ray-dashboard spec: type: LoadBalancer # This selector must match the head node pod's selector. selector: component: kf-prod01-ray-head ports: - name: dashboard protocol: TCP port: 8265 targetPort: 8265 ``` Apply the cluster config by: ``` kubectl -n ray apply -f cluster.yaml ``` Then check the translated cluster config inside the ray operator: ``` (base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml auth: {} available_node_types: head-node: max_workers: 0 min_workers: 0 node_config: apiVersion: v1 kind: Pod metadata: generateName: ray-head- labels: component: kf-prod01-ray-head ownerReferences: - &id001 apiVersion: cluster.ray.io/v1 blockOwnerDeletion: true controller: true kind: RayCluster name: kf-prod01 uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047 spec: containers: - args: - 'trap : TERM INT; sleep infinity & wait;' command: - /bin/bash - -c - -- image: rayproject/ray:1.2.0 imagePullPolicy: Always name: ray-node ports: - containerPort: 6379 protocol: TCP - containerPort: 10001 protocol: TCP - containerPort: 12345 protocol: TCP - containerPort: 12346 protocol: TCP - containerPort: 8265 protocol: TCP resources: limits: cpu: 2000m memory: 4Gi requests: cpu: 2000m memory: 4Gi volumeMounts: - mountPath: /dev/shm name: dshm nodeSelector: cloud.google.com/gke-nodepool: default-pool restartPolicy: Never volumes: - emptyDir: medium: Memory name: dshm resources: CPU: 2 worker_setup_commands: - pip install pipdate==0.5.2 worker-node: max_workers: 3 min_workers: 2 node_config: apiVersion: v1 kind: Pod metadata: generateName: ray-worker- ownerReferences: - *id001 spec: containers: - args: - 'trap : TERM INT; sleep infinity & wait;' command: - /bin/bash - -c - -- image: rayproject/ray:1.2.0 imagePullPolicy: Always name: ray-node resources: limits: cpu: 2000m memory: 4Gi requests: cpu: 2000m memory: 4Gi volumeMounts: - mountPath: /dev/shm name: dshm nodeSelector: cloud.google.com/gke-nodepool: cpu-worker-pool01 restartPolicy: Never volumes: - emptyDir: medium: Memory name: dshm resources: CPU: 2 worker_setup_commands: - pip install --ignore-installed ruamel.yaml==0.16.12 - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0 cluster_name: kf-prod01 cluster_synced_files: [] file_mounts: {} file_mounts_sync_continuously: false head_node: {} head_node_type: head-node head_setup_commands: [] head_start_ray_commands: - ray stop - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001 idle_timeout_minutes: 5 initialization_commands: [] max_workers: 3 provider: namespace: ray services: - apiVersion: v1 kind: Service metadata: name: kf-prod01-ray-head namespace: ray spec: ports: - name: client port: 10001 protocol: TCP targetPort: 10001 - name: dashboard port: 8265 protocol: TCP targetPort: 8265 selector: component: kf-prod01-ray-head type: kubernetes use_internal_ips: true setup_commands: [] upscaling_speed: 1 worker_nodes: {} worker_setup_commands: [] worker_start_ray_commands: - ray stop - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346 ``` Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`. Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node.

blublinsky · March 26, 2021, 8:01pm

Did you guys consider to install head node as a deployment of 1 to allow deployment to restart it in the case of failures.

Topic		Replies	Views
Stuck trying to take down workers Kubernetes	22	2342	March 29, 2021
Testing autoscaler Kubernetes	15	1547	March 16, 2021
Ray Worker pod stuck at init stage and unable to be created Ray Clusters	8	679	August 7, 2024
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	726	November 20, 2023
Ray operator + client-server + autoscaling + openshift Kubernetes	11	1680	February 11, 2021

Autoscaler does not seem to watch head node

Related topics