The Ray agent couldn't be started due to the port conflict. To solve the problem, start Ray with a hard-coded agent port

We are running some ray.tune unit tests on Linux. The test setup is like this. We only launch one ray cluster.


ray.init()

tune.run(xxxx)

ray.shutdown()

we occasionally see this error.


E ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.

E Could not create the actor because its associated runtime env failed to be created.

E Failed to create runtime environment {"envVars": {"TUNE_ORIG_WORKING_DIR": "xxxx”}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.

However, I think internally ray assigns this port randomly. Saw a similar issue from TEST: CI: flaky ray windows failure: "System error: Unknown error" · Issue #4905 · modin-project/modin · GitHub . Also, ray.init() does not have a argument called dashboard-agent-grpc-port…
How to properly bypass this flakiness? Should we have a subprocess running ray start --dashboard-agent-grpc-port 0? Any suggestions?

cc @architkulkarni I think we should retry if we find that the port number that we randomly generated is currently in-use. Is that work planned for runtime env creation // is my diagnosis correct?

Sorry you’re running into this, I think the best thing would be to specify the port in ray start as you suggest. We plan to fix this in the future.

1 Like

doing

+        sock = socket.socket()
+        sock.bind(('', 0))
+        free_port = sock.getsockname()[1]
+        subprocess.check_call(["ray", "start" , "--head", f"--dashboard-agent-grpc-port={free_port}", "--include-dashboard=False"])
+        ray.init(address="auto")

Still doesn’t solve the issue.@architkulkarni Do you have some suggestions?

Can you share the log from dashboard_agent.log?

Synced offline. @raytune_kuberay_user will try setting both agent ports specified in this doc Configuring Ray — Ray 3.0.0.dev0

Thanks. For local testing, do you recommand us to do
ray.init(local_mode=True) ? Will this help resolving the port conflict? @sangcho

We deprecated that feature from 2.2 (the feature was unmaintained for a while…). We recommend you to run ray.init() and run workloads. If you think local_mode=True is important for unit test, please file a feature request to re-enable it!

1 Like

I’m running into the same issue (Failed to create runtime environment for job 01000000 because the Ray agent couldn't be started due to the port conflict. See dashboard_agent.log for more details.. Is this planned way to do this from ray.init or do I still need to call out to a seperate subprocess even for unit testing? Is there now a better way to do this? Furthermore, in my ray.init call I have set include_dashboard=False but am still getting this error. Is this expected?