I have installed k8 Ray cluster using Ray up. 
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well? 
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.
Are those bugs?
             
            
               
               
               
            
            
           
          
            
            
              Also, ray down does not remove the service associated with head pod. Also Ray up sometimes does create this service, but sometimes it does not.
             
            
               
               
               
            
            
           
          
            
              
                eoakes  
                
               
              
                  
                    March 26, 2021,  5:54pm
                   
                   
              3 
               
             
            
              @Dmitri  could you clarify the behavior here please?
             
            
               
               
               
            
            
           
          
            
              
                Dmitri  
                
               
              
                  
                    March 26, 2021,  6:27pm
                   
                   
              4 
               
             
            
              
Currently, Ray down does not delete the service and Ray up creates the service if it’s not already present. 
There’s an issue open to change this behavior so that Ray down deletes the service
  
  
    
  
  
    
    
      
        opened 04:48AM - 16 Mar 21 UTC 
      
        
          closed 08:00PM - 19 Nov 22 UTC 
        
      
     
    
        
          bug
         
        
          P2
         
        
          k8s
         
    
   
 
  
    When creating a cluster on Kubernetes, Ray will allocate a service routing traff… ic to the head node when the user adds this to their cluster config:
```
services:
      - apiVersion: v1
        kind: Service
        metadata:
            name: local-cluster-ray-head
        spec:
            selector:
                component: local-cluster-ray-head
            ports:
                - name: client
                  protocol: TCP
                  port: 10001
                  targetPort: 10001
                - name: dashboard
                  protocol: TCP
                  port: 8265
                  targetPort: 8265
```
However, when calling `ray down cluster.yaml`, this service (unlike the pods) will not be removed. The expected behavior is that all resources created by `ray up` should be properly cleaned up after calling `ray down`.
cc @richardliaw 
   
   
  
    
    
  
  
 
             
            
               
               
               
            
            
           
          
            
              
                Dmitri  
                
               
              
                  
                    March 26, 2021,  6:35pm
                   
                   
              5 
               
             
            
              This so
 blublinsky:
 
I have installed k8 Ray cluster using Ray up. 
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well? 
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.
Are those bugs?
 
 
Ray down should remove all pods created by Ray up, regardless of their status. If this didn’t work for you, it would be great if you could file an issue on the Ray github with bug reproduction details!
Unfortunately, we currently don’t implement error handling to deal with Ray head failure. In fact, when launching clusters on Kubernetes with the cluster launcher (ray up), the autoscaler runs on the head node so there’s no way for the autoscaler to recover the head node. 
In the future, the Ray K8s Operator will implement sensible logic to deal with head failure – 
this issue is tracked here
  
  
    
  
  
    
    
      
        opened 06:21AM - 17 Mar 21 UTC 
      
        
          closed 03:45PM - 29 Apr 21 UTC 
        
      
     
    
        
          bug
         
        
          triage
         
    
   
 
  
    ### What is the problem?
- Ray version: 1.2.0
- OS version: Ubuntu 20.04.2 LTS…  (Focal Fossa)
- Python version: 3.7.7
### Reproduction (REQUIRED)
Example cluster config for ray operator:
```
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  namespace: ray
  name: kf-prod01
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: ray-head-
      spec:
        restartPolicy: Never
        nodeSelector:
          cloud.google.com/gke-nodepool: default-pool
        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:1.2.0
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 12345 # Ray internal communication.
          - containerPort: 12346 # Ray internal communication.
          - containerPort: 8265  # Used by Ray Dashboard
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 2000m
              memory: 4Gi
            limits:
              cpu: 2000m
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 4Gi
    setupCommands:
     - pip install pipdate==0.5.2
  - name: worker-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 3
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"foo": 1, "bar": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: ray-worker-
      spec:
        restartPolicy: Never
        nodeSelector:
          cloud.google.com/gke-nodepool: cpu-worker-pool01
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:1.2.0
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 2000m
              memory: 4Gi
            limits:
              cpu: 2000m
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 4Gi
    setupCommands:
      - pip install --ignore-installed ruamel.yaml==0.16.12
      - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
      - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0
      #- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl
      #- pip install --no-cache-dir pytorch-lightning==1.1.8
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346
---
apiVersion: v1
kind: Service
metadata:
  namespace: ray
  name: ray-head
spec:
  # This selector must match the head node pod's selector.
  selector:
    component: kf-prod01-ray-head
  ports:
  - name: client
    protocol: TCP
    port: 10001
    targetPort: 10001
  # Redis ports.
  - name: redis-primary
    port: 6379
    targetPort: 6379
  # Ray internal communication ports.
  - name: object-manager
    protocol: TCP
    port: 12345
    targetPort: 12345
  - name: node-manager
    protocol: TCP
    port: 12346
    targetPort: 12346
---
apiVersion: v1
kind: Service
metadata:
  namespace: ray
  name: ray-dashboard
spec:
  type: LoadBalancer
  # This selector must match the head node pod's selector.
  selector:
    component: kf-prod01-ray-head
  ports:
  - name: dashboard
    protocol: TCP
    port: 8265
    targetPort: 8265
```
Apply the cluster config by:
```
kubectl -n ray apply -f cluster.yaml
```
Then check the translated cluster config inside the ray operator:
```
(base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml
auth: {}
available_node_types:
  head-node:
    max_workers: 0
    min_workers: 0
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
        labels:
          component: kf-prod01-ray-head
        ownerReferences:
        - &id001
          apiVersion: cluster.ray.io/v1
          blockOwnerDeletion: true
          controller: true
          kind: RayCluster
          name: kf-prod01
          uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          image: rayproject/ray:1.2.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
            protocol: TCP
          - containerPort: 10001
            protocol: TCP
          - containerPort: 12345
            protocol: TCP
          - containerPort: 12346
            protocol: TCP
          - containerPort: 8265
            protocol: TCP
          resources:
            limits:
              cpu: 2000m
              memory: 4Gi
            requests:
              cpu: 2000m
              memory: 4Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        nodeSelector:
          cloud.google.com/gke-nodepool: default-pool
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    resources:
      CPU: 2
    worker_setup_commands:
    - pip install pipdate==0.5.2
  worker-node:
    max_workers: 3
    min_workers: 2
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-
        ownerReferences:
        - *id001
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          image: rayproject/ray:1.2.0
          imagePullPolicy: Always
          name: ray-node
          resources:
            limits:
              cpu: 2000m
              memory: 4Gi
            requests:
              cpu: 2000m
              memory: 4Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        nodeSelector:
          cloud.google.com/gke-nodepool: cpu-worker-pool01
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    resources:
      CPU: 2
    worker_setup_commands:
    - pip install --ignore-installed ruamel.yaml==0.16.12
    - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
    - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8
      ray[tune]==1.2.0
cluster_name: kf-prod01
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node: {}
head_node_type: head-node
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345
  --node-manager-port=12346 --ray-client-server-port 10001
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 3
provider:
  namespace: ray
  services:
  - apiVersion: v1
    kind: Service
    metadata:
      name: kf-prod01-ray-head
      namespace: ray
    spec:
      ports:
      - name: client
        port: 10001
        protocol: TCP
        targetPort: 10001
      - name: dashboard
        port: 8265
        protocol: TCP
        targetPort: 8265
      selector:
        component: kf-prod01-ray-head
  type: kubernetes
  use_internal_ips: true
setup_commands: []
upscaling_speed: 1
worker_nodes: {}
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345
  --node-manager-port=12346
```
Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`.
Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node. 
   
   
  
    
    
  
  
 
The Ray operator runs the autoscaler in a pod separate from the Ray cluster.
Here’s the documentation on the cluster launcher and operator.
  
  
    
  
  
    
    
      
        opened 06:21AM - 17 Mar 21 UTC 
      
        
          closed 03:45PM - 29 Apr 21 UTC 
        
      
     
    
        
          bug
         
        
          triage
         
    
   
 
  
    ### What is the problem?
- Ray version: 1.2.0
- OS version: Ubuntu 20.04.2 LTS…  (Focal Fossa)
- Python version: 3.7.7
### Reproduction (REQUIRED)
Example cluster config for ray operator:
```
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  namespace: ray
  name: kf-prod01
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
  - name: head-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 0
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: ray-head-
      spec:
        restartPolicy: Never
        nodeSelector:
          cloud.google.com/gke-nodepool: default-pool
        # This volume allocates shared memory for Ray to use for its plasma
        # object store. If you do not provide this, Ray will fall back to
        # /tmp which cause slowdowns if is not a shared memory volume.
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:1.2.0
          # Do not change this command - it keeps the pod alive until it is
          # explicitly killed.
          command: ["/bin/bash", "-c", "--"]
          args: ['trap : TERM INT; sleep infinity & wait;']
          ports:
          - containerPort: 6379  # Redis port
          - containerPort: 10001  # Used by Ray Client
          - containerPort: 12345 # Ray internal communication.
          - containerPort: 12346 # Ray internal communication.
          - containerPort: 8265  # Used by Ray Dashboard
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 2000m
              memory: 4Gi
            limits:
              cpu: 2000m
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 4Gi
    setupCommands:
     - pip install pipdate==0.5.2
  - name: worker-node
    # Minimum number of Ray workers of this Pod type.
    minWorkers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
    maxWorkers: 3
    # User-specified custom resources for use by Ray.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # rayResources: {"foo": 1, "bar": 1}
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        # Automatically generates a name for the pod with this prefix.
        generateName: ray-worker-
      spec:
        restartPolicy: Never
        nodeSelector:
          cloud.google.com/gke-nodepool: cpu-worker-pool01
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        containers:
        - name: ray-node
          imagePullPolicy: Always
          image: rayproject/ray:1.2.0
          command: ["/bin/bash", "-c", "--"]
          args: ["trap : TERM INT; sleep infinity & wait;"]
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          resources:
            requests:
              cpu: 2000m
              memory: 4Gi
            limits:
              cpu: 2000m
              # The maximum memory that this pod is allowed to use. The
              # limit will be detected by ray and split to use 10% for
              # redis, 30% for the shared memory object store, and the
              # rest for application memory. If this limit is not set and
              # the object store size is not set manually, ray will
              # allocate a very large object store in each pod that may
              # cause problems for other pods.
              memory: 4Gi
    setupCommands:
      - pip install --ignore-installed ruamel.yaml==0.16.12
      - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
      - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0
      #- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl
      #- pip install --no-cache-dir pytorch-lightning==1.1.8
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346
---
apiVersion: v1
kind: Service
metadata:
  namespace: ray
  name: ray-head
spec:
  # This selector must match the head node pod's selector.
  selector:
    component: kf-prod01-ray-head
  ports:
  - name: client
    protocol: TCP
    port: 10001
    targetPort: 10001
  # Redis ports.
  - name: redis-primary
    port: 6379
    targetPort: 6379
  # Ray internal communication ports.
  - name: object-manager
    protocol: TCP
    port: 12345
    targetPort: 12345
  - name: node-manager
    protocol: TCP
    port: 12346
    targetPort: 12346
---
apiVersion: v1
kind: Service
metadata:
  namespace: ray
  name: ray-dashboard
spec:
  type: LoadBalancer
  # This selector must match the head node pod's selector.
  selector:
    component: kf-prod01-ray-head
  ports:
  - name: dashboard
    protocol: TCP
    port: 8265
    targetPort: 8265
```
Apply the cluster config by:
```
kubectl -n ray apply -f cluster.yaml
```
Then check the translated cluster config inside the ray operator:
```
(base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml
auth: {}
available_node_types:
  head-node:
    max_workers: 0
    min_workers: 0
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
        labels:
          component: kf-prod01-ray-head
        ownerReferences:
        - &id001
          apiVersion: cluster.ray.io/v1
          blockOwnerDeletion: true
          controller: true
          kind: RayCluster
          name: kf-prod01
          uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          image: rayproject/ray:1.2.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
            protocol: TCP
          - containerPort: 10001
            protocol: TCP
          - containerPort: 12345
            protocol: TCP
          - containerPort: 12346
            protocol: TCP
          - containerPort: 8265
            protocol: TCP
          resources:
            limits:
              cpu: 2000m
              memory: 4Gi
            requests:
              cpu: 2000m
              memory: 4Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        nodeSelector:
          cloud.google.com/gke-nodepool: default-pool
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    resources:
      CPU: 2
    worker_setup_commands:
    - pip install pipdate==0.5.2
  worker-node:
    max_workers: 3
    min_workers: 2
    node_config:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-
        ownerReferences:
        - *id001
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          image: rayproject/ray:1.2.0
          imagePullPolicy: Always
          name: ray-node
          resources:
            limits:
              cpu: 2000m
              memory: 4Gi
            requests:
              cpu: 2000m
              memory: 4Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        nodeSelector:
          cloud.google.com/gke-nodepool: cpu-worker-pool01
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    resources:
      CPU: 2
    worker_setup_commands:
    - pip install --ignore-installed ruamel.yaml==0.16.12
    - pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
    - pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8
      ray[tune]==1.2.0
cluster_name: kf-prod01
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node: {}
head_node_type: head-node
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345
  --node-manager-port=12346 --ray-client-server-port 10001
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 3
provider:
  namespace: ray
  services:
  - apiVersion: v1
    kind: Service
    metadata:
      name: kf-prod01-ray-head
      namespace: ray
    spec:
      ports:
      - name: client
        port: 10001
        protocol: TCP
        targetPort: 10001
      - name: dashboard
        port: 8265
        protocol: TCP
        targetPort: 8265
      selector:
        component: kf-prod01-ray-head
  type: kubernetes
  use_internal_ips: true
setup_commands: []
upscaling_speed: 1
worker_nodes: {}
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345
  --node-manager-port=12346
```
Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`.
Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node. 
   
   
  
    
    
  
  
 
             
            
               
               
              1 Like 
            
            
           
          
            
            
              Did you guys consider to install head node as a deployment of 1 to allow deployment to restart it in the case of failures.
             
            
               
               
              1 Like