Protect communication in cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It blocks me from cluster provisioning. :slightly_smiling_face:

I want to know is there any authentication mechanism available when connecting to head node using python ray client. Also want to know about how ray effectively secure communication btw processes and workers.

Note: Here, we are planning to use on-premise ray cluster. Kindly suggest solution for the same.

TIA, :slightly_smiling_face:

Hi @Jules_Damji , Could anyone help me out in this?

@BalajiSelvaraj10 Ray can beused for TLS authentication

Thanks for pointing this out @Jules_Damji . Mentioned link explains TLS configuration for kube cluster. Could you guide to achieve the same for on-premise cluster?., Is there any document?

cc:@Chen_Shen do we have doc guide how to use SSL for Ray? The TLS Authentication guide we have is for Kubernetes

Thanks for prompt responses., @Jules_Damji :slightly_smiling_face:

@Chen_Shen , could u share the documentation for configuring TLS in on-premise cluster.?

hi @BalajiSelvaraj10
Configuring Ray β€” Ray 2.6.1 are the env variables you can set.

1 Like

Hi @Chen_Shen, i am trying to do TLS authentication for on-premise Ray cluster too and have some issues. The doc seems to be for Ray head as a service k8s kind and not for RayCluster. I tried to modify it in a similar manner (create secret, mount configMap), but keep getting an error β€œ2025-01-08 08:01:21,454 C 44 44] (gcs_server) grpc_server.cc:124: Check failed: server_ Failed to start the grpc server” port is already used by other processes …Do you have a hint for me?

here is my yaml file:

apiVersion: v1
kind: Secret
metadata:
  name: ca-tls
data:
  # output from cat ca.crt | base64
  ca.crt: |
    ...
  # output from cat ca.key | base64
  ca.key: |
    ...
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: tls
data:
  gencert_head.sh: |
    #!/bin/sh
    ## Create tls.key
    openssl genrsa -out /etc/ray/tls/tls.key 2048

    ## Write CSR Config
    cat > /etc/ray/tls/csr.conf <<EOF
    [ req ]
    default_bits = 2048
    prompt = no
    default_md = sha256
    req_extensions = req_ext
    distinguished_name = dn

    [ dn ]
    C = US
    ST = California
    L = San Fransisco
    O = ray
    OU = ray
    CN = *.ray.io

    [ req_ext ]
    subjectAltName = @alt_names

    [ alt_names ]
    DNS.1 = localhost
    DNS.2 = duy-raycluster-head-svc.az-cp-launch.svc.cluster.local
    IP.1 = 127.0.0.1
    IP.2 = $POD_IP

    EOF

    ## Create CSR using tls.key
    openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf

    ## Write cert config
    cat > /etc/ray/tls/cert.conf <<EOF

    authorityKeyIdentifier=keyid,issuer
    basicConstraints=CA:FALSE
    keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
    subjectAltName = @alt_names

    [alt_names]
    DNS.1 = localhost
    DNS.2 = duy-raycluster-head-svc.az-cp-launch.svc.cluster.local
    IP.1 = 127.0.0.1
    IP.2 = $POD_IP

    EOF

    ## Generate tls.cert
    openssl x509 -req \
        -in /etc/ray/tls/ca.csr \
        -CA /etc/ray/tls/ca.crt -CAkey /etc/ray/tls/ca.key \
        -CAcreateserial -out /etc/ray/tls/tls.crt \
        -days 365 \
        -sha256 -extfile /etc/ray/tls/cert.conf

  gencert_worker.sh: |
    #!/bin/sh
    ## Create tls.key
    openssl genrsa -out /etc/ray/tls/tls.key 2048

    ## Write CSR Config
    cat > /etc/ray/tls/csr.conf <<EOF
    [ req ]
    default_bits = 2048
    prompt = no
    default_md = sha256
    req_extensions = req_ext
    distinguished_name = dn

    [ dn ]
    C = US
    ST = California
    L = San Fransisco
    O = ray
    OU = ray
    CN = *.ray.io

    [ req_ext ]
    subjectAltName = @alt_names

    [ alt_names ]
    DNS.1 = localhost
    IP.1 = 127.0.0.1
    IP.2 = $POD_IP

    EOF

    ## Create CSR using tls.key
    openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf

    ## Write cert config
    cat > /etc/ray/tls/cert.conf <<EOF

    authorityKeyIdentifier=keyid,issuer
    basicConstraints=CA:FALSE
    keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
    subjectAltName = @alt_names

    [alt_names]
    DNS.1 = localhost
    IP.1 = 127.0.0.1
    IP.2 = $POD_IP

    EOF

    ## Generate tls.cert
    openssl x509 -req \
        -in /etc/ray/tls/ca.csr \
        -CA /etc/ray/tls/ca.crt -CAkey /etc/ray/tls/ca.key \
        -CAcreateserial -out /etc/ray/tls/tls.crt \
        -days 365 \
        -sha256 -extfile /etc/ray/tls/cert.conf
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: duy-raycluster
spec:
  rayVersion: '2.40.0'
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default 
    idleTimeoutSeconds: 60
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
      port: "6379"              # Redis port for Ray.
      dashboard-port: "8265"    # Dashboard port.
      ray-client-server-port: "10001"      # Client port.
    template:
      metadata:
        labels:
          sidecar.istio.io/inject: "false"
        name: duy-ray-head
      spec:
        serviceAccountName: default-editor
        nodeSelector:
          instance-type: g5.8xlarge
        tolerations:
          - effect: NoSchedule
            key: as_gpu_g5_8xlarge_ns
            operator: Equal
            value: "true"
          - effect: NoExecute
            key: as_gpu_g5_8xlarge_ne
            operator: Equal
            value: "true"
        initContainers:
        - name: ray-head-tls
          image: rayproject/ray:2.40.0-py311-gpu
          command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_head.sh"]
          volumeMounts:
            - mountPath: /etc/ca/tls
              name: ca-tls
              readOnly: true
            - mountPath: /etc/ray/tls
              name: ray-tls
            - mountPath: /etc/gen/tls
              name: gen-tls-script
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
        containers:
        - name: ray-head
          image: rayproject/ray:2.40.0-py311-gpu
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            - mountPath: /temp
              name: tmp
          resources:
            limits:
              cpu: "2"
              memory: "10G"
            requests:
              cpu: "1"
              memory: "8G"
          env:
            - name: RAY_USE_TLS
              value: "1"
            - name: RAY_TLS_SERVER_CERT
              value: "/etc/ray/tls/tls.crt"
            - name: RAY_TLS_SERVER_KEY
              value: "/etc/ray/tls/tls.key"
            - name: RAY_TLS_CA_CERT
              value: "/etc/ca/tls/ca.crt"
            - name: RAY_BACKEND_LOG_LEVEL
              value: warning
            - name: KUBERAY_GEN_RAY_START_CMD
              value: "ray start --head --port=6379 --gcs-server-port=6379 --num-cpus=4 --dashboard-host=0.0.0.0 --object-manager-port=8076 --node-manager-port=8077 --dashboard-agent-grpc-port=8078 --dashboard-agent-listen-port=52365 --block"
        volumes:
          - name: ca-tls
            secret:
              secretName: ca-tls
          - name: ray-tls
            emptyDir: {}
                # The gencert_head.sh can be prebaked into the docker container so the configMap is optional
          - name: gen-tls-script
            configMap:
              name: tls
              defaultMode: 0777
              # An array of keys from the ConfigMap to create as files
              items:
              - key: gencert_head.sh
                path: gencert_head.sh
          - name: ray-logs
            emptyDir: {}
          - name: tmp
            emptyDir: {}
          - name: public-folder
            persistentVolumeClaim:
              claimName: public-folder
          - name: launch-cache
            persistentVolumeClaim:
              claimName: launch-cache
  workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 64
      groupName: gpu-group
      rayStartParams:
        num-cpus: '4'
      template:
        metadata:
          labels:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccountName: default-editor
          nodeSelector:
            instance-type: g5.8xlarge
          tolerations:
            - effect: NoSchedule
              key: as_gpu_g5_8xlarge_ns
              operator: Equal
              value: "true"
            - effect: NoExecute
              key: as_gpu_g5_8xlarge_ne
              operator: Equal
              value: "true"
          initContainers:
          - name: ray-worker-tls
            image: rayproject/ray:2.40.0-py311-gpu
            command: ["/bin/sh", "-c", "cp -R /etc/ca/tls /etc/ray && /etc/gen/tls/gencert_worker.sh"]
            volumeMounts:
              - mountPath: /etc/ca/tls
                name: ca-tls
                readOnly: true
              - mountPath: /etc/ray/tls
                name: ray-tls
              - mountPath: /etc/gen/tls
                name: gen-tls-script
            env:
              - name: POD_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
          containers:
          - name: ray-worker
            image: rayproject/ray:2.40.0-py311-gpu
            ports:
              - containerPort: 6379 # GCS server
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh","-c","ray stop"]
            env:
            - name: KUBERAY_GEN_RAY_START_CMD
              value: "ray start  --memory=118111600640  --num-gpus=1  --num-cpus=4  --address=duy-raycluster-head-svc.az-cp-launch.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365"
            - name: RAY_USE_TLS
              value: "1"
            - name: RAY_TLS_SERVER_CERT
              value: "/etc/ray/tls/tls.crt"
            - name: RAY_TLS_SERVER_KEY
              value: "/etc/ray/tls/tls.key"
            - name: RAY_TLS_CA_CERT
              value: "/etc/ca/tls/ca.crt"
            - name: RAY_BACKEND_LOG_LEVEL
              value: warning
            - name: RAY_memory_usage_threshold   # Add this line
              value: "0.99"                      # Set threshold to 90%
       
            volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            - mountPath: /temp
              name: tmp
              limits:
                cpu: "27"
                memory: "110Gi"
                nvidia.com/gpu: 1
              requests:
                cpu: "24"
                memory: "100Gi"
                nvidia.com/gpu: 1
            securityContext:
             runAsUser: 1000
             allowPrivilegeEscalation: true
             capabilities:
               drop: ["ALL"]
             runAsNonRoot: true
             seccompProfile:
               type: RuntimeDefault
          volumes:
          - name: ray-logs
            emptyDir: {}
          - name: tmp
            emptyDir: {}
          - name: launch-cache
            persistentVolumeClaim:
              claimName: launch-cache
          - name: ca-tls
            secret:
              secretName: ca-tls
          - name: ray-tls
            emptyDir: {}
                # The gencert_worker.sh can be prebaked into the docker container so the configMap is optional
          - name: gen-tls-script
            configMap:
              name: tls
              defaultMode: 0777
              # An array of keys from the ConfigMap to create as files
              items:
              - key: gencert_worker.sh
                path: gencert_worker.sh
          - name: dshm
            emptyDir:
              medium: Memory  # This specifies the volume will use RAM
              sizeLimit: 64Gi

I found an example with ray cluster-tls and it seems to work fine: kuberay/ray-operator/config/samples/ray-cluster.tls.yaml at master Β· ray-project/kuberay Β· GitHub