Unable to connect to Ray Cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have deployed a Ray Cluster based on this Helm Chart with some slight modifications made to the Ingress to port forward the Dashboard, Client and Serve ports like so:

Assume the Host URL is https://ray.com

  1. Ray Dashboard is accessible from https://ray.com as I have set / to port forward port 8265
  2. Ray Client is accessible from https://ray.com/client as I have set /client to port forward port 10001
  3. Ray Serve is accessible from https://ray.com/serve as I have set /serve to port forward port 8000

The dashboard works while the serve and client endpoints return a 404 Not Found error, but I am assuming that they are supposed to do that when you make a curl request to them.

NOTE: I am accessing the Ray Cluster which is deployed within a VPC through a VPN.

Methods to Submit a Job Attempted

  1. Submitting a job to the cluster to run works when I use the REST API method:
  # Output of the jobs I deployed
  {'raysubmit_jxQVCQ3LRGHtuw8C': {'status': 'SUCCEEDED', 'entrypoint': 'echo hello', 'message': 'Job finished successfully.', 'error_type': None, 'start_time': 1661220561123, 'end_time': 1661220561801, 'metadata': {'job_submission_id': '1234'}, 'runtime_env': {}}, 'raysubmit_TXw3cZHZuy5s9FaD': {'status': 'SUCCEEDED', 'entrypoint': 'echo hello', 'message': 'Job finished successfully.', 'error_type': None, 'start_time': 1661153417462, 'end_time': 1661153418330, 'metadata': {'job_submission_id': '123'}, 'runtime_env': {}}}
  1. Submitting a job via the Python SDK does not work, I have tried the following:
  • client = JobSubmissionClient("http://ray.com") # connection timeout
  • client = JobSubmissionClient("http://ray.com/client") # Jobs API is not supported on the Ray cluster. Please ensure the cluster is running Ray 1.9 or higher.(not sure how to upgrade the Ray Cluster as I am using the latest image as it is.
  1. Submitting a job via the CLI:
    I tried to first all the jobs but even that does not work. Why does Ray automatically assume that I need to port-forward the address that is passed in? Is there any way to disable this?
> ray job list --address "http://ray.com"     
ConnectionError: Failed to connect to Ray at the address: http://ray.com:8265.

Ray Serve Job Submission Attempt

I also tried to deploy a simple Ray Serve Deployment based on the Ray Serve Deployment example.

import ray
from ray import serve

# Connect to the running Ray cluster.
ray.init(
    address="ray://ray.com",
    namespace="test-serve",
)
# Bind on 0.0.0.0 to expose the HTTP server on external IPs.
serve.start(detached=True, http_options={"host": "0.0.0.0"})


@serve.deployment(route_prefix="/hello")
def hello(request):
    return "hello world"

hello.deploy()

I tried all the following addresses as well:

  • ray://ray.com/client
  • ray://ray.com/serve

But all of them throw the same error ConnectionError: ray client connection timeout.

Ingress

I have attached the Ingress file below for reference just in case there is some error on my end with the deployment:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "staging-ray-cluster-ingress"
  namespace: {{ .Values.operatorNamespace }}
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: {{ .Values.sgElb }}
    alb.ingress.kubernetes.io/group.name: "ray"
    alb.ingress.kubernetes.io/listen-ports:  '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.certificate }}
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/healthcheck-path: "/"
    alb.ingress.kubernetes.io/success-codes: "200"
  labels:
    app: "staging-ray-cluster"
spec:
  rules:
  - host: {{ .Values.externalDNS.domainName}} # ray.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ray-head
            port:
              number: 8265
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ssl-redirect
            port:
              name: use-annotation
      - path: /client
        pathType: Prefix
        backend:
          service:
            name: ray-head
            port:
              number: 10001
      - path: /serve
        pathType: Prefix
        backend:
          service:
            name: ray-head
            port:
              number: 8000

Sorry for the late reply. @architkulkarni @ckw017 Could you help him?

The version incompatibility message is surprising – are you using the image rayproject/ray:2.0.0 (or an image built using that image as the base)?
Are you also using Ray 2.0.0 on the machine that is submitting the job?

Unfortunately the version error message might be a red herring–is there a full traceback?

I think the simplest common thing we need to get working is the single line ray.init(address="ray://ray.com"); once that works all of the above examples should work.

Is there any further traceback for the ConnectionError when you run this line?

@ckw017 do you have experience with setting up Ingress files and port forwarding? Does that part look okay to you?

I’m tracking the documentation gap here.

In the Helm Chart I am using this image rayproject/ray:latest.

Here is the exact traceback

---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
<ipython-input-5-23f69200547b> in <module>
----> 1 ray.init(address="ray://staging-ray.moneylion.io", namespace="test")

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105         return func(*args, **kwargs)
    106 
    107     return wrapper

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/worker.py in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, _node_name, **kwargs)
    885         passed_kwargs.update(kwargs)
    886         builder._init_args(**passed_kwargs)
--> 887         return builder.connect()
    888 
    889     if kwargs:

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/client_builder.py in connect(self)
    162             job_config=self._job_config,
    163             _credentials=self._credentials,
--> 164             ray_init_kwargs=self._remote_init_kwargs,
    165         )
    166         get_dashboard_url = ray.remote(ray.worker.get_dashboard_url)

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client_connect.py in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
     43         ignore_version=ignore_version,
     44         _credentials=_credentials,
---> 45         ray_init_kwargs=ray_init_kwargs,
     46     )
     47     return conn

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/__init__.py in connect(self, *args, **kw_args)
    241     def connect(self, *args, **kw_args):
    242         self.get_context()._inside_client_test = self._inside_client_test
--> 243         conn = self.get_context().connect(*args, **kw_args)
    244         global _lock, _all_contexts
    245         with _lock:

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/__init__.py in connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
     89                 _credentials=_credentials,
     90                 metadata=metadata,
---> 91                 connection_retries=connection_retries,
     92             )
     93             self.api.worker = self.client_worker

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py in __init__(self, conn_str, secure, metadata, connection_retries, _credentials)
    141         self._has_connected = False
    142 
--> 143         self._connect_channel()
    144         self._has_connected = True
    145 

~/opt/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py in _connect_channel(self, reconnecting)
    262                     "more information."
    263                 )
--> 264             raise ConnectionError("ray client connection timeout")
    265 
    266     def _can_reconnect(self, e: grpc.RpcError) -> bool:

ConnectionError: ray client connection timeout

An update from my end, after consulting my DevOps team they recommended that I try pointing each port to a different host through Ingress like so, but the client and serve endpoints are still throwing a 502 Bad Gateway error:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "staging-ray-cluster-ingress"
  namespace: {{ .Values.operatorNamespace }}
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: {{ .Values.sgElb }}
    alb.ingress.kubernetes.io/group.name: "ray"
    alb.ingress.kubernetes.io/listen-ports:  '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.certMoneylion }}
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'

    # allow 404s on the health check test 1
    alb.ingress.kubernetes.io/healthcheck-path: "/"
    alb.ingress.kubernetes.io/success-codes: "200"
  labels:
    app: "staging-ray-cluster"
spec:
  rules:
  - host: {{ .Values.externalDNS.domainNameClient}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 10001
  - host: {{ .Values.externalDNS.domainNameServe}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 8000
  - host: {{ .Values.externalDNS.domainNameDashboard}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 8265

Are you able to port-forward from your cluster to your local machine? It would help rule out if the issue with the ingress/service. Basically you would port-forward staging-ray-ray-head:10001 until your local machine, and try ray.init("ray://localhost:10001") to see if the connection works.

Hey yes, it does. I would like to confirm I have read that the client and serve endpoints are GRPC instead of HTTP which could be the issue here. Could you confirm it? If so I will need to add the following for the AWS ALB Ingress to work.

  1. alb.ingress.kubernetes.io/healthcheck-path: /package.service/method
  2. alb.ingress.kubernetes.io/backend-protocol-version: GRPC

Did switching to the GRPC annotation work for you?

Nope, I tried the following where I converted it to multiple Ingresses for each service and added the GRPC annotation to the backend-protocol version referencing this thread but no luck. I need the Ray Team to inform me about the path in this format /package.service/method for the healthcheck and the backend for it to work it seems, cc. @ckw017 @architkulkarni

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "staging-ray-cluster-ingress-dashboard"
  namespace: {{ .Values.operatorNamespace }}
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: {{ .Values.sgElb }}
    alb.ingress.kubernetes.io/group.name: ray
    alb.ingress.kubernetes.io/listen-ports:  '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.cert }}
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/healthcheck-path: "/"
    alb.ingress.kubernetes.io/success-codes: "200"
  labels:
    app: "staging-ray-cluster"
spec:
  rules:
  - host: {{ .Values.externalDNS.domainNameDashboard}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 8265
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "staging-ray-cluster-ingress-client"
  namespace: {{ .Values.operatorNamespace }}
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: {{ .Values.sgElb }}
    alb.ingress.kubernetes.io/group.name: ray
    alb.ingress.kubernetes.io/listen-ports:  '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.cert }}
    alb.ingress.kubernetes.io/backend-protocol-version: GRPC
    # TODO:
    # 1. Figure out healthcheck path
    # 2. Figure out backend path 
  labels:
    app: "staging-ray-cluster"
spec:
  rules:
  - host: {{ .Values.externalDNS.domainNameClient}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 10001
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "staging-ray-cluster-ingress-serve"
  namespace: {{ .Values.operatorNamespace }}
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/security-groups: {{ .Values.sgElb }}
    alb.ingress.kubernetes.io/group.name: ray
    alb.ingress.kubernetes.io/listen-ports:  '[{"HTTP": 80}, {"HTTPS":443}]'
    alb.ingress.kubernetes.io/certificate-arn: {{ .Values.cert }}
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/healthcheck-path: "/"
    alb.ingress.kubernetes.io/success-codes: "200"
  labels:
    app: "staging-ray-cluster"
spec:
  rules:
  - host: {{ .Values.externalDNS.domainNameServe}}
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: staging-ray-ray-head
            port:
              number: 8000
---

Ah, this looks like it could work though! @Dmitri

Digging around the Ray Client protobuf definition, it looks like there isn’t a good existing method to use for healthchecking. The closest would be ClusterInfo with the type set to PING, but doesn’t look like there’s a proper way to configure an argument with ALB.

This blog post mentions a workaround for this by using a success-code of 12 (gRPC method not found). Can you check if this workaround works for you?

Using the above ALB configurations, with the GRPC healthcheck of 12 the cluster can be found using: ray.init(‘ray.com:443’)

Which returned this:

2022-09-07 15:39:05,870 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: ray.com:443

Unfortunately, this infinite warning was returned.

2022-09-07 14:53:43,842 WARNING utils.py:1219 -- Unable to connect to GCS at ray.com:443. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

The client does work on the cluster with the following:

kubectl exec raycluster-autoscaler-head-skjs9 -it -c ray-head -- python -c "import ray; ray.init()"
2022-09-07 08:47:14,092 INFO worker.py:1224 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2022-09-07 08:47:14,092 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 10.196.11.72:6379...
2022-09-07 08:47:14,097 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at http://10.196.11.72:8265

Should the ingress be serving the client or GCS ports? Or do the network rules of the ingress need access to the GCS Port?

The ingress should be serving the client port, connecting to the GCS port directly (i.e. a plain ray.init()) will only work if the script you’re running is colocated on the head node/a worker node.

Your call should look something like ray.init(address="ray://ray.com:10001") (or have 443 route to 10001 on the head node) if you want to connect to a remote cluster. GCS doesn’t need to be exposed for that to work.

I added those but the healthcheck of the target group of the client endpoint is still unhealthy. The dashboard and serve endpoints are fine though. I can also reproduce what @wfclark5 did.

I see, is there any way to get details on why the target group health check is failing?

I’m not sure why the target group is failing. As mentioned before, the type PING health check looks for ray.rpc.RayletDriver/ClusterInfo.

When removing SSL/TLS in the Ingress and ALB configurations this was the latest response that was returned:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/utils.py", line 1327, in internal_kv_get_with_retry
    result = gcs_client.internal_kv_get(key, namespace)
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 177, in wrapper
    return f(self, *args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/gcs_utils.py", line 269, in internal_kv_get
    reply = self._kv_stub.InternalKVGet(req, timeout=timeout)
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNIMPLEMENTED
        details = "Method not found!"
        debug_error_string = "{"created":"@1662717804.510619240","description":"Error received from peer ipv4:10.196.11.70:80","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Method not found!","grpc_status":12}"
>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/worker.py", line 1475, in init
    _global_node = ray._private.node.Node(
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/node.py", line 182, in __init__
    session_name = ray._private.utils.internal_kv_get_with_retry(
  File "/home/ubuntu/miniconda3/envs/ray/lib/python3.8/site-packages/ray/_private/utils.py", line 1349, in internal_kv_get_with_retry
    raise ConnectionError(```