How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Our use case is the following: we’re trying to set up Ray Tune across our codebase, and we anticipate many concurrent connections to a Ray cluster at once (many Tune jobs going at once). Each of one of these is making a separate ray.init
call.
We are using kuberay to run our Ray cluster on k8s
When trying to test this functionality out by creating many background processes (50) and having each one create a new connection (separated by a small time delta of 1s so that they are not all trying to connect simultaneously), there are a variety of errors that appear. Some of those are listed here:
2022-11-30 19:30:36,427 ERROR dataclient.py:323 -- Unrecoverable error in data channel.
raise ConnectionError(msg)
ConnectionError: Request can't be sent because the Ray client has already been disconnected due to an error. Last exception: Failed to reconnect within the reconnection grace period (30s)
and
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/ray/util/client/server/proxier.py", line 693, in Datapath
raise RuntimeError(
RuntimeError: Proxy failed to Connect to backend! Check `ray_client_server.err` and `ray_client_server_23030.err` on the head node of the cluster for the relevant logs. By default these are located at /tmp/ray/session_latest/logs.
where the .err file shows:
2022-11-30 19:37:55,877 INFO server.py:884 -- Starting Ray Client server on 0.0.0.0:23030
2022-11-30 19:38:01,332 INFO server.py:931 -- 25 idle checks before shutdown.
2022-11-30 19:38:03,890 INFO logservicer.py:103 -- New logs connection established. Total clients: 1
2022-11-30 19:38:06,398 INFO server.py:931 -- 20 idle checks before shutdown.
2022-11-30 19:38:11,482 INFO server.py:931 -- 15 idle checks before shutdown.
2022-11-30 19:38:16,509 INFO server.py:931 -- 10 idle checks before shutdown.
2022-11-30 19:38:21,535 INFO server.py:931 -- 5 idle checks before shutdown.
My question here is two-fold.
- Is there a way to avoid these types of connection errors?
- more generally, is there a preferred way to establish many different client-server connections and ideally minimize the amount of RAM that each one takes up on the head node?
cc @spolcyn