How severe does this issue affect your experience of using Ray?
Medium: It blocks me from cluster provisioning.
I want to know is there any authentication mechanism available when connecting to head node using python ray client. Also want to know about how ray effectively secure communication btw processes and workers.
Note: Here, we are planning to use on-premise ray cluster. Kindly suggest solution for the same.
TIA,
Hi @Jules_Damji , Could anyone help me out in this?
Thanks for pointing this out @Jules_Damji . Mentioned link explains TLS configuration for kube cluster. Could you guide to achieve the same for on-premise cluster?., Is there any document?
cc:@Chen_Shen do we have doc guide how to use SSL for Ray? The TLS Authentication guide we have is for Kubernetes
Thanks for prompt responses., @Jules_Damji
@Chen_Shen , could u share the documentation for configuring TLS in on-premise cluster.?
dnd
January 8, 2025, 4:24pm
8
Hi @Chen_Shen , i am trying to do TLS authentication for on-premise Ray cluster too and have some issues. The doc seems to be for Ray head as a service k8s kind and not for RayCluster. I tried to modify it in a similar manner (create secret, mount configMap), but keep getting an error β2025-01-08 08:01:21,454 C 44 44] (gcs_server) grpc_server.cc:124: Check failed: server_ Failed to start the grpc serverβ port is already used by other processes β¦Do you have a hint for me?
dnd
January 8, 2025, 4:30pm
9
here is my yaml file:
apiVersion: v1
kind: Secret
metadata:
name: ca-tls
data:
# output from cat ca.crt | base64
ca.crt: |
...
# output from cat ca.key | base64
ca.key: |
...
---
apiVersion: v1
kind: ConfigMap
metadata:
name: tls
data:
gencert_head.sh: |
#!/bin/sh
## Create tls.key
openssl genrsa -out /etc/ray/tls/tls.key 2048
## Write CSR Config
cat > /etc/ray/tls/csr.conf <<EOF
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
C = US
ST = California
L = San Fransisco
O = ray
OU = ray
CN = *.ray.io
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = localhost
DNS.2 = duy-raycluster-head-svc.az-cp-launch.svc.cluster.local
IP.1 = 127.0.0.1
IP.2 = $POD_IP
EOF
## Create CSR using tls.key
openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf
## Write cert config
cat > /etc/ray/tls/cert.conf <<EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
subjectAltName = @alt_names
[alt_names]
DNS.1 = localhost
DNS.2 = duy-raycluster-head-svc.az-cp-launch.svc.cluster.local
IP.1 = 127.0.0.1
IP.2 = $POD_IP
EOF
## Generate tls.cert
openssl x509 -req \
-in /etc/ray/tls/ca.csr \
-CA /etc/ray/tls/ca.crt -CAkey /etc/ray/tls/ca.key \
-CAcreateserial -out /etc/ray/tls/tls.crt \
-days 365 \
-sha256 -extfile /etc/ray/tls/cert.conf
gencert_worker.sh: |
#!/bin/sh
## Create tls.key
openssl genrsa -out /etc/ray/tls/tls.key 2048
## Write CSR Config
cat > /etc/ray/tls/csr.conf <<EOF
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
C = US
ST = California
L = San Fransisco
O = ray
OU = ray
CN = *.ray.io
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = localhost
IP.1 = 127.0.0.1
IP.2 = $POD_IP
EOF
## Create CSR using tls.key
openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf
## Write cert config
cat > /etc/ray/tls/cert.conf <<EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
subjectAltName = @alt_names
[alt_names]
DNS.1 = localhost
IP.1 = 127.0.0.1
IP.2 = $POD_IP
EOF
## Generate tls.cert
openssl x509 -req \
-in /etc/ray/tls/ca.csr \
-CA /etc/ray/tls/ca.crt -CAkey /etc/ray/tls/ca.key \
-CAcreateserial -out /etc/ray/tls/tls.crt \
-days 365 \
-sha256 -extfile /etc/ray/tls/cert.conf
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: duy-raycluster
spec:
rayVersion: '2.40.0'
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
port: "6379" # Redis port for Ray.
dashboard-port: "8265" # Dashboard port.
ray-client-server-port: "10001" # Client port.
template:
metadata:
labels:
sidecar.istio.io/inject: "false"
name: duy-ray-head
spec:
serviceAccountName: default-editor
nodeSelector:
instance-type: g5.8xlarge
tolerations:
- effect: NoSchedule
key: as_gpu_g5_8xlarge_ns
operator: Equal
value: "true"
- effect: NoExecute
key: as_gpu_g5_8xlarge_ne
operator: Equal
value: "true"
initContainers:
- name: ray-head-tls
image: rayproject/ray:2.40.0-py311-gpu
command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_head.sh"]
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
containers:
- name: ray-head
image: rayproject/ray:2.40.0-py311-gpu
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /temp
name: tmp
resources:
limits:
cpu: "2"
memory: "10G"
requests:
cpu: "1"
memory: "8G"
env:
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: "/etc/ray/tls/tls.crt"
- name: RAY_TLS_SERVER_KEY
value: "/etc/ray/tls/tls.key"
- name: RAY_TLS_CA_CERT
value: "/etc/ca/tls/ca.crt"
- name: RAY_BACKEND_LOG_LEVEL
value: warning
- name: KUBERAY_GEN_RAY_START_CMD
value: "ray start --head --port=6379 --gcs-server-port=6379 --num-cpus=4 --dashboard-host=0.0.0.0 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
volumes:
- name: ca-tls
secret:
secretName: ca-tls
- name: ray-tls
emptyDir: {}
# The gencert_head.sh can be prebaked into the docker container so the configMap is optional
- name: gen-tls-script
configMap:
name: tls
defaultMode: 0777
# An array of keys from the ConfigMap to create as files
items:
- key: gencert_head.sh
path: gencert_head.sh
- name: ray-logs
emptyDir: {}
- name: tmp
emptyDir: {}
- name: public-folder
persistentVolumeClaim:
claimName: public-folder
- name: launch-cache
persistentVolumeClaim:
claimName: launch-cache
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 64
groupName: gpu-group
rayStartParams:
num-cpus: '4'
template:
metadata:
labels:
sidecar.istio.io/inject: "false"
spec:
serviceAccountName: default-editor
nodeSelector:
instance-type: g5.8xlarge
tolerations:
- effect: NoSchedule
key: as_gpu_g5_8xlarge_ns
operator: Equal
value: "true"
- effect: NoExecute
key: as_gpu_g5_8xlarge_ne
operator: Equal
value: "true"
initContainers:
- name: ray-worker-tls
image: rayproject/ray:2.40.0-py311-gpu
command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_worker.sh"]
volumeMounts:
- mountPath: /etc/ca/tls
name: ca-tls
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
containers:
- name: ray-worker
image: rayproject/ray:2.40.0-py311-gpu
ports:
- containerPort: 6379 # GCS server
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
env:
- name: KUBERAY_GEN_RAY_START_CMD
value: "ray start --memory=118111600640 --num-gpus=1 --num-cpus=4 --address=duy-raycluster-head-svc.az-cp-launch.svc.cluster.local:6379 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365"
- name: RAY_USE_TLS
value: "1"
- name: RAY_TLS_SERVER_CERT
value: "/etc/ray/tls/tls.crt"
- name: RAY_TLS_SERVER_KEY
value: "/etc/ray/tls/tls.key"
- name: RAY_TLS_CA_CERT
value: "/etc/ca/tls/ca.crt"
- name: RAY_BACKEND_LOG_LEVEL
value: warning
- name: RAY_memory_usage_threshold # Add this line
value: "0.99" # Set threshold to 90%
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /temp
name: tmp
limits:
cpu: "27"
memory: "110Gi"
nvidia.com/gpu: 1
requests:
cpu: "24"
memory: "100Gi"
nvidia.com/gpu: 1
securityContext:
runAsUser: 1000
allowPrivilegeEscalation: true
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
volumes:
- name: ray-logs
emptyDir: {}
- name: tmp
emptyDir: {}
- name: launch-cache
persistentVolumeClaim:
claimName: launch-cache
- name: ca-tls
secret:
secretName: ca-tls
- name: ray-tls
emptyDir: {}
# The gencert_worker.sh can be prebaked into the docker container so the configMap is optional
- name: gen-tls-script
configMap:
name: tls
defaultMode: 0777
# An array of keys from the ConfigMap to create as files
items:
- key: gencert_worker.sh
path: gencert_worker.sh
- name: dshm
emptyDir:
medium: Memory # This specifies the volume will use RAM
sizeLimit: 64Gi
dnd
January 9, 2025, 3:30pm
10