Access ray serve from the inside K8s cluster

I cannot access deployed Ray Serve service from another Pod within the same K8s.
From the same port, I can initialize ray client to the cluster using ray.init('ray://raycluster-autoscaler-head-svc.default.svc.cluster.local:10001') but response = requests.post('http://raycluster-autoscaler-head-svc.default.svc.cluster:8000/', json=text_input) does not work.
I could only use response = requests.post('http://localhost:8000/', json=text_input) from within the same python process where I deployed ray serve.

Can you check how raycluster-autoscaler-head-svc is configured? It might not expose port 8000.

1 Like

If you remove the list of ports from the Ray head container config, the serve port should automatically be exposed in the service.

I configured port 8000 among others.
@Dmitri good tip. I’ll try that next time
For now, I followed this template and it exposes ray serve in as a service and I was able to call ray serve in the form of http://rayservice-sample-serve-svc.default.svc.cluster.local:8000/

So the question is in order to expose rayserve externally do I need to apply this special template or I can just do it from vanilla Ray cluster?

You can use a vanilla RayCluster CR (using a RayService CR is not required) – just make sure to either

  • specify port 8000 in the Ray head container’s ports list
    OR
  • don’t specify any ports, in which case default ports will be configured for the service

You did say you configured port 8000, so it is surprising that it’s not working.

Could you share the full configuration for which this didn’t work?

This is what I had

# This config demonstrates KubeRay's Ray autoscaler integration.
# The resource requests and limits in this config are too small for production!
# For an example with more realistic resource configuration, see
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: '2.0.0'
  # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
  # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
  # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
  enableInTreeAutoscaling: true
  # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # upscalingMode is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 60
    # image optionally overrides the autoscaler's container image.
    # If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
    # the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
    ## image: "my-repo/my-custom-autoscaler-image:tag"
    # imagePullPolicy optionally overrides the autoscaler container's image pull policy.
    imagePullPolicy: Always
    # resources specifies optional resource request and limit overrides for the autoscaler container.
    # For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "2"
        memory: "10G"
      requests:
        cpu: "2"
        memory: "10G"
  ######################headGroupSpec#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
    serviceType: ClusterIP
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block ...
    rayStartParams:
      # Flag "no-monitor" will be automatically set when autoscaling is enabled.
      dashboard-host: '0.0.0.0'
      block: 'true'
      # num-cpus: '1' # can be auto-completed from the limits
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the specific format demonstrated below:
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    #pod template
    template:
      spec:
        containers:
        # The Ray head pod
        - name: ray-head
          image: rayproject/ray:2.0.0
          imagePullPolicy: Always
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "2"
              memory: "10G"
            requests:
              cpu: "2"
              memory: "10G"
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 300
    # logical group name, for this called small-group, also can be functional
    groupName: small-group
    # if worker pods need to be added, we can simply increment the replicas
    # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
    # the operator will remove pods from the list until the number of replicas is satisfied
    # when a pod is confirmed to be deleted, its name will be removed from the list below
    #scaleStrategy:
    #  workersToDelete:
    #  - raycluster-complete-worker-small-group-bdtwh
    #  - raycluster-complete-worker-small-group-hv457
    #  - raycluster-complete-worker-small-group-k8tj7
    # the following params are used to complete the ray start: ray start --block ...
    rayStartParams:
      block: 'true'
    #pod template
    template:
      metadata:
        labels:
          key: value
        # annotations for pod
        annotations:
          key: value
      spec:
        initContainers:
        # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
        - name: init-myservice
          image: busybox:1.28
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
        containers:
        - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray:2.0.0
          # environment variables to set in the container.Optional.
          # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "2"
              memory: "10G"
            requests:
              cpu: "2"
              memory: "10G"

1 Like

Thanks for the config!

Posting my progress here:
I’m able to connect via port-forwarding from my laptop and using localhost:8000 from the head pod, but I am seeing the connection error when using the service’s k8s DNS… I think we’re close.

This is surprising. I’ll call over some colleagues to look into it. The reproduction is to submit the following script to the cluster.

import requests
from starlette.requests import Request
from typing import Dict
import time

from ray import serve


# 1: Define a Ray Serve deployment.
@serve.deployment(route_prefix="/")
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}


# 2: Deploy the model.
serve.run(MyModelDeployment.bind(msg="Hello world!"))

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/").json())

while True:
    print("Sleeping!")
    time.sleep(10)

If you run the following in the head pod, you get the expected output.

wget -O - "http://localhost:8000"

If you run the following in the head pod, you do get the expected output.

wget -O - "http://localhost:8000"

If your run the following in the head pod, you get a connection refused error.

wget -O - "http://raycluster-autoscaler-head-svc.default.svc.cluster.local:8000/"

If you port-forward the service’s port 8000 to your local machine, the following gives the correct ouptut.

wget -O - "http://localhost:8000"

We were able to fix Dmitri’s repro above by doing serve.run(MyModelDeployment.bind(msg="Hello world!"), host="0.0.0.0") (adding the host argument, see Serve docs for more detail). Can you check if this works for you?