I have installed k8 Ray cluster using Ray up.
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well?
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.
Are those bugs?
Also, ray down does not remove the service associated with head pod. Also Ray up sometimes does create this service, but sometimes it does not.
eoakes
March 26, 2021, 5:54pm
3
@Dmitri could you clarify the behavior here please?
Dmitri
March 26, 2021, 6:27pm
4
Currently, Ray down does not delete the service and Ray up creates the service if it’s not already present.
There’s an issue open to change this behavior so that Ray down deletes the service
opened 04:48AM - 16 Mar 21 UTC
closed 08:00PM - 19 Nov 22 UTC
bug
P2
k8s
When creating a cluster on Kubernetes, Ray will allocate a service routing traff… ic to the head node when the user adds this to their cluster config:
```
services:
- apiVersion: v1
kind: Service
metadata:
name: local-cluster-ray-head
spec:
selector:
component: local-cluster-ray-head
ports:
- name: client
protocol: TCP
port: 10001
targetPort: 10001
- name: dashboard
protocol: TCP
port: 8265
targetPort: 8265
```
However, when calling `ray down cluster.yaml`, this service (unlike the pods) will not be removed. The expected behavior is that all resources created by `ray up` should be properly cleaned up after calling `ray down`.
cc @richardliaw
Dmitri
March 26, 2021, 6:35pm
5
This so
blublinsky:
I have installed k8 Ray cluster using Ray up.
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well?
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.
Are those bugs?
Ray down should remove all pods created by Ray up, regardless of their status. If this didn’t work for you, it would be great if you could file an issue on the Ray github with bug reproduction details!
Unfortunately, we currently don’t implement error handling to deal with Ray head failure. In fact, when launching clusters on Kubernetes with the cluster launcher (ray up
), the autoscaler runs on the head node so there’s no way for the autoscaler to recover the head node.
In the future, the Ray K8s Operator will implement sensible logic to deal with head failure –
this issue is tracked here
opened 06:21AM - 17 Mar 21 UTC
closed 03:45PM - 29 Apr 21 UTC
bug
triage
### What is the problem?
- Ray version: 1.2.0
- OS version: Ubuntu 20.04.2 LTS… (Focal Fossa)
- Python version: 3.7.7
### Reproduction (REQUIRED)
Example cluster config for ray operator:
```
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
namespace: ray
name: kf-prod01
spec:
# The maximum number of workers nodes to launch in addition to the head node.
maxWorkers: 3
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: head-node
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 0
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-head-
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: rayproject/ray:1.2.0
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ['trap : TERM INT; sleep infinity & wait;']
ports:
- containerPort: 6379 # Redis port
- containerPort: 10001 # Used by Ray Client
- containerPort: 12345 # Ray internal communication.
- containerPort: 12346 # Ray internal communication.
- containerPort: 8265 # Used by Ray Dashboard
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 2000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 4Gi
setupCommands:
- pip install pipdate==0.5.2
- name: worker-node
# Minimum number of Ray workers of this Pod type.
minWorkers: 2
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 3
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"foo": 1, "bar": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-worker-
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: cpu-worker-pool01
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: rayproject/ray:1.2.0
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 2000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 4Gi
setupCommands:
- pip install --ignore-installed ruamel.yaml==0.16.12
- pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
- pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0
#- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl
#- pip install --no-cache-dir pytorch-lightning==1.1.8
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346
---
apiVersion: v1
kind: Service
metadata:
namespace: ray
name: ray-head
spec:
# This selector must match the head node pod's selector.
selector:
component: kf-prod01-ray-head
ports:
- name: client
protocol: TCP
port: 10001
targetPort: 10001
# Redis ports.
- name: redis-primary
port: 6379
targetPort: 6379
# Ray internal communication ports.
- name: object-manager
protocol: TCP
port: 12345
targetPort: 12345
- name: node-manager
protocol: TCP
port: 12346
targetPort: 12346
---
apiVersion: v1
kind: Service
metadata:
namespace: ray
name: ray-dashboard
spec:
type: LoadBalancer
# This selector must match the head node pod's selector.
selector:
component: kf-prod01-ray-head
ports:
- name: dashboard
protocol: TCP
port: 8265
targetPort: 8265
```
Apply the cluster config by:
```
kubectl -n ray apply -f cluster.yaml
```
Then check the translated cluster config inside the ray operator:
```
(base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml
auth: {}
available_node_types:
head-node:
max_workers: 0
min_workers: 0
node_config:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-head-
labels:
component: kf-prod01-ray-head
ownerReferences:
- &id001
apiVersion: cluster.ray.io/v1
blockOwnerDeletion: true
controller: true
kind: RayCluster
name: kf-prod01
uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
image: rayproject/ray:1.2.0
imagePullPolicy: Always
name: ray-node
ports:
- containerPort: 6379
protocol: TCP
- containerPort: 10001
protocol: TCP
- containerPort: 12345
protocol: TCP
- containerPort: 12346
protocol: TCP
- containerPort: 8265
protocol: TCP
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
resources:
CPU: 2
worker_setup_commands:
- pip install pipdate==0.5.2
worker-node:
max_workers: 3
min_workers: 2
node_config:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-worker-
ownerReferences:
- *id001
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
image: rayproject/ray:1.2.0
imagePullPolicy: Always
name: ray-node
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
nodeSelector:
cloud.google.com/gke-nodepool: cpu-worker-pool01
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
resources:
CPU: 2
worker_setup_commands:
- pip install --ignore-installed ruamel.yaml==0.16.12
- pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
- pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8
ray[tune]==1.2.0
cluster_name: kf-prod01
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node: {}
head_node_type: head-node
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345
--node-manager-port=12346 --ray-client-server-port 10001
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 3
provider:
namespace: ray
services:
- apiVersion: v1
kind: Service
metadata:
name: kf-prod01-ray-head
namespace: ray
spec:
ports:
- name: client
port: 10001
protocol: TCP
targetPort: 10001
- name: dashboard
port: 8265
protocol: TCP
targetPort: 8265
selector:
component: kf-prod01-ray-head
type: kubernetes
use_internal_ips: true
setup_commands: []
upscaling_speed: 1
worker_nodes: {}
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345
--node-manager-port=12346
```
Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`.
Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node.
The Ray operator runs the autoscaler in a pod separate from the Ray cluster.
Here’s the documentation on the cluster launcher and operator.
opened 06:21AM - 17 Mar 21 UTC
closed 03:45PM - 29 Apr 21 UTC
bug
triage
### What is the problem?
- Ray version: 1.2.0
- OS version: Ubuntu 20.04.2 LTS… (Focal Fossa)
- Python version: 3.7.7
### Reproduction (REQUIRED)
Example cluster config for ray operator:
```
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
namespace: ray
name: kf-prod01
spec:
# The maximum number of workers nodes to launch in addition to the head node.
maxWorkers: 3
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: head-node
# Minimum number of Ray workers of this Pod type.
minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 0
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-head-
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: rayproject/ray:1.2.0
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ['trap : TERM INT; sleep infinity & wait;']
ports:
- containerPort: 6379 # Redis port
- containerPort: 10001 # Used by Ray Client
- containerPort: 12345 # Ray internal communication.
- containerPort: 12346 # Ray internal communication.
- containerPort: 8265 # Used by Ray Dashboard
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 2000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 4Gi
setupCommands:
- pip install pipdate==0.5.2
- name: worker-node
# Minimum number of Ray workers of this Pod type.
minWorkers: 2
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
maxWorkers: 3
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
# rayResources: {"foo": 1, "bar": 1}
podConfig:
apiVersion: v1
kind: Pod
metadata:
# Automatically generates a name for the pod with this prefix.
generateName: ray-worker-
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: cpu-worker-pool01
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: rayproject/ray:1.2.0
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 2000m
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 4Gi
setupCommands:
- pip install --ignore-installed ruamel.yaml==0.16.12
- pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
- pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8 ray[tune]==1.2.0
#- pip install rollsroyce-0.0.1.post0.dev141+g301b4af-py2.py3-none-any.whl
#- pip install --no-cache-dir pytorch-lightning==1.1.8
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345 --node-manager-port=12346 --ray-client-server-port 10001
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345 --node-manager-port=12346
---
apiVersion: v1
kind: Service
metadata:
namespace: ray
name: ray-head
spec:
# This selector must match the head node pod's selector.
selector:
component: kf-prod01-ray-head
ports:
- name: client
protocol: TCP
port: 10001
targetPort: 10001
# Redis ports.
- name: redis-primary
port: 6379
targetPort: 6379
# Ray internal communication ports.
- name: object-manager
protocol: TCP
port: 12345
targetPort: 12345
- name: node-manager
protocol: TCP
port: 12346
targetPort: 12346
---
apiVersion: v1
kind: Service
metadata:
namespace: ray
name: ray-dashboard
spec:
type: LoadBalancer
# This selector must match the head node pod's selector.
selector:
component: kf-prod01-ray-head
ports:
- name: dashboard
protocol: TCP
port: 8265
targetPort: 8265
```
Apply the cluster config by:
```
kubectl -n ray apply -f cluster.yaml
```
Then check the translated cluster config inside the ray operator:
```
(base) ray@ray-operator-pod:~$ cat ~/ray_cluster_configs/kf-prod01_config.yaml
auth: {}
available_node_types:
head-node:
max_workers: 0
min_workers: 0
node_config:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-head-
labels:
component: kf-prod01-ray-head
ownerReferences:
- &id001
apiVersion: cluster.ray.io/v1
blockOwnerDeletion: true
controller: true
kind: RayCluster
name: kf-prod01
uid: e21f7f17-6ade-4f50-a3cc-ba09651e6047
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
image: rayproject/ray:1.2.0
imagePullPolicy: Always
name: ray-node
ports:
- containerPort: 6379
protocol: TCP
- containerPort: 10001
protocol: TCP
- containerPort: 12345
protocol: TCP
- containerPort: 12346
protocol: TCP
- containerPort: 8265
protocol: TCP
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
nodeSelector:
cloud.google.com/gke-nodepool: default-pool
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
resources:
CPU: 2
worker_setup_commands:
- pip install pipdate==0.5.2
worker-node:
max_workers: 3
min_workers: 2
node_config:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-worker-
ownerReferences:
- *id001
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
image: rayproject/ray:1.2.0
imagePullPolicy: Always
name: ray-node
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 2000m
memory: 4Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
nodeSelector:
cloud.google.com/gke-nodepool: cpu-worker-pool01
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
resources:
CPU: 2
worker_setup_commands:
- pip install --ignore-installed ruamel.yaml==0.16.12
- pip install --no-cache-dir torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
- pip install --no-cache-dir scipy==1.6.0 deprecated==1.2.11 gsutil==4.59 pytorch-lightning==1.1.8
ray[tune]==1.2.0
cluster_name: kf-prod01
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node: {}
head_node_type: head-node
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 --object-manager-port=12345
--node-manager-port=12346 --ray-client-server-port 10001
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 3
provider:
namespace: ray
services:
- apiVersion: v1
kind: Service
metadata:
name: kf-prod01-ray-head
namespace: ray
spec:
ports:
- name: client
port: 10001
protocol: TCP
targetPort: 10001
- name: dashboard
port: 8265
protocol: TCP
targetPort: 8265
selector:
component: kf-prod01-ray-head
type: kubernetes
use_internal_ips: true
setup_commands: []
upscaling_speed: 1
worker_nodes: {}
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=12345
--node-manager-port=12346
```
Looks like the `setupCommands` in the operator config is translated into `worker_setup_commands` for head node, work node, as well as cluster level `worker_setup_commands`.
Looks like the problem is [here](https://github.com/ray-project/ray/blob/ray-1.2.0/python/ray/operator/operator_utils.py#L30), the same config map is used by both head node and worker node.
1 Like
Did you guys consider to install head node as a deployment of 1 to allow deployment to restart it in the case of failures.
1 Like