How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello! I have successfully launched a static ray cluster via kubernetes. For various reasons, my kubernetes cluster does not allow the use of KubeRay, and so I have resorted to using a Static Ray Cluster. Below is the deployment file for launching the cluster, more or less ripped off from this file referenced in the documentation
# This section is only required for deploying Redis on Kubernetes for the purpose of enabling Ray
# to write GCS metadata to an external Redis for fault tolerance. If you have already deployed Redis
# on Kubernetes, this section can be removed.
kind: ConfigMap
apiVersion: v1
metadata:
name: redis-config
labels:
app: redis
data:
redis.conf: |-
dir /data
port 6379
bind 0.0.0.0
appendonly yes
protected-mode no
pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: ClusterIP
ports:
- name: redis
port: 6379
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:5.0.8
command:
- "sh"
- "-c"
- "redis-server /usr/local/etc/redis/redis.conf"
ports:
- containerPort: 6379
volumeMounts:
- name: config
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
resources:
limits:
cpu: "1"
memory: "2G"
requests:
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as Kind and minikube.
cpu: "500m"
# The rest state memory usage of the Ray head node is around 1Gb. We do not
# recommend allocating less than 2Gb memory for the Ray head pod.
# For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
memory: "2G"
volumes:
- name: config
configMap:
name: redis-config
---
# Ray head node service, allowing worker pods to discover the head node to perform the bidirectional communication.
# More contexts can be found at [the Ports configurations doc](https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations).
apiVersion: v1
kind: Service
metadata:
name: service-ray-cluster
labels:
app: ray-cluster-head
spec:
clusterIP: None
ports:
- name: client
protocol: TCP
port: 10001
targetPort: 10001
- name: dashboard
protocol: TCP
port: 8265
targetPort: 8265
- name: gcs-server
protocol: TCP
port: 6380
targetPort: 6380
selector:
app: ray-cluster-head
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-ray-head
labels:
app: ray-cluster-head
spec:
# Do not change this - Ray currently only supports one head node per cluster.
replicas: 1
selector:
matchLabels:
component: ray-head
type: ray
app: ray-cluster-head
template:
metadata:
labels:
component: ray-head
type: ray
app: ray-cluster-head
spec:
# If the head node goes down, the entire cluster (including all worker
# nodes) will go down as well. If you want Kubernetes to bring up a new
# head node in this case, set this to "Always," else set it to "Never."
restartPolicy: Always
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if it's not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-head
image: rayproject/ray:2.8.0
imagePullPolicy: Always
command: [ "/bin/bash", "-c", "--" ]
# if there is no password for Redis, set --redis-password=''
args:
- "ray start --head --port=6380 --num-cpus=$MY_CPU_REQUEST --dashboard-host=0.0.0.0 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --redis-password='' --block"
ports:
- containerPort: 6380 # GCS server
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if it's not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
env:
# RAY_REDIS_ADDRESS lets ray use external Redis for fault tolerance
- name: RAY_REDIS_ADDRESS
value: redis:6379 # ip address for the external Redis, which is "redis:6379" in this example
# This is used in the ray start command so that Ray can spawn the
# correct number of processes. Omitting this may lead to degraded
# performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.cpu
resources:
limits:
cpu: "1"
memory: "2G"
requests:
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as Kind and minikube.
cpu: "500m"
# The rest state memory usage of the Ray head node is around 1Gb. We do not
# recommend allocating less than 2Gb memory for the Ray head pod.
# For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
memory: "2G"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-ray-worker
labels:
app: ray-cluster-worker
spec:
# Change this to scale the number of worker nodes started in the Ray cluster.
replicas: 2
selector:
matchLabels:
component: ray-worker
type: ray
app: ray-cluster-worker
template:
metadata:
labels:
component: ray-worker
type: ray
app: ray-cluster-worker
spec:
restartPolicy: Always
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-worker
image: rayproject/ray:2.8.0
imagePullPolicy: Always
command: ["/bin/bash", "-c", "--"]
args:
- "ray start --num-cpus=$MY_CPU_REQUEST --address=service-ray-cluster:6380 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if it's not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
env:
# This is used in the ray start command so that Ray can spawn the
# correct number of processes. Omitting this may lead to degraded
# performance.
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
resource: requests.cpu
# The resource requests and limits in this config are too small for production!
# It is better to use a few large Ray pods than many small ones.
# For production, it is ideal to size each Ray pod to take up the
# entire Kubernetes node on which it is scheduled.
resources:
limits:
cpu: "1"
memory: "1G"
# For production use-cases, we recommend specifying integer CPU reqests and limits.
# We also recommend setting requests equal to limits for both CPU and memory.
# For this example, we use a 500m CPU request to accomodate resource-constrained local
# Kubernetes testing environments such as Kind and minikube.
requests:
cpu: "500m"
memory: "1G"
In addition, I have added the following LoadBalancer
, so that I can send requests to the cluster remotely via the ray client / submit api:
---
apiVersion: v1
kind: Service
metadata:
name: ray-external-servivce
namespace: bbhnet
spec:
ports:
- name: ray-head-node-service
port: 10001
protocol: TCP
targetPort: 10001
selector:
app: ray-cluster-head
type: LoadBalancer
Now, I wan’t to be able to send remote requests to this cluster via a python script by querying the load balancers ip address (e.g. via kubectl get services
) and running something like:
import ray
@ray.remote
def hello_world():
return "Hello World"
ray.init(address="{LOAD_BALANCER_IP_ADDRESS}:10001")
print(ray.get(hello_world.remote()))
However, I keep getting ConnectionError: ray client connection timeout
errors. I have tried the various different ports, and nothing seems to work. Would be grateful for any help!