Error when creating many simultaneous client-server connections

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Our use case is the following: we’re trying to set up Ray Tune across our codebase, and we anticipate many concurrent connections to a Ray cluster at once (many Tune jobs going at once). Each of one of these is making a separate ray.init call.

We are using kuberay to run our Ray cluster on k8s

When trying to test this functionality out by creating many background processes (50) and having each one create a new connection (separated by a small time delta of 1s so that they are not all trying to connect simultaneously), there are a variety of errors that appear. Some of those are listed here:

2022-11-30 19:30:36,427 ERROR -- Unrecoverable error in data channel.
    raise ConnectionError(msg)
ConnectionError: Request can't be sent because the Ray client has already been disconnected due to an error. Last exception: Failed to reconnect within the reconnection grace period (30s)


ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/ray/util/client/server/", line 693, in Datapath
    raise RuntimeError(
RuntimeError: Proxy failed to Connect to backend! Check `ray_client_server.err` and `ray_client_server_23030.err` on the head node of the cluster for the relevant logs. By default these are located at /tmp/ray/session_latest/logs.

where the .err file shows:

2022-11-30 19:37:55,877 INFO -- Starting Ray Client server on
2022-11-30 19:38:01,332 INFO -- 25 idle checks before shutdown.
2022-11-30 19:38:03,890 INFO -- New logs connection established. Total clients: 1
2022-11-30 19:38:06,398 INFO -- 20 idle checks before shutdown.
2022-11-30 19:38:11,482 INFO -- 15 idle checks before shutdown.
2022-11-30 19:38:16,509 INFO -- 10 idle checks before shutdown.
2022-11-30 19:38:21,535 INFO -- 5 idle checks before shutdown.

My question here is two-fold.

  1. Is there a way to avoid these types of connection errors?
  2. more generally, is there a preferred way to establish many different client-server connections and ideally minimize the amount of RAM that each one takes up on the head node?

cc @spolcyn

1 Like

Hi @walid.a, sorry you’re running into these issues. Which Ray version are you using? Is it possible to use the Ray Jobs API for your use case? We’re moving towards recommending Ray Jobs as the best practice for submitting jobs to a long-running cluster. It might be more robust due to not having to maintain many simultaneous connections.

I’ll also tag @ckw017 in case he has any tips on debugging the specific errors you saw.

Hi @architkulkarni, thank you for the response! We are using Ray version 2.1.0. We were actually considering moving towards Ray Jobs for this, so it’s very good to know that that’s the recommended direction

1 Like