Grpc port bind issue after multiple successful jobs

virtualluke · August 30, 2023, 3:48am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

After multiple successful ray jobs (after a ray cluster start) I get a random problem where
a next ray job cant seem to start the grpc_server due to a port already bound issue. Nothing is making use of these ports except for ray. If I restart the cluster I can then run a next sequence of jobs that work until the error below. I can sometimes run hundreds of jobs each making use of 10,000+ worker tasks before I get this error and sometimes it happens after only a handful. Very unreproducible on demand. Any thoughts? This is under redhat 9.2, python 3.11.4, ray 2.6.1 on a cluster with 20 nodes (20 nodes x 96 cores). If I do a restart of the cluster and try the job it goes through fine until some random number of runs and I get another error like the below.

I am getting an error like:

(raylet, ip=192.168.1.100) E0828 03:39:18.870348198 4034351 chttp2_server.cc:1051] {“created”:“@1693193958.870304813”,“description”:“No address added out of total 1 resolved for ‘0.0.0.0:18321’”,“file”:“external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc”,“file_line”:947,“referenced_errors”:[{“created”:“@1693193958.870301759”,“description”:“Failed to add any wildcard listeners”,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc”,“file_line”:357,“referenced_errors”:[{“created”:“@1693193958.870295770”,“description”:“Unable to configure socket”,“fd”:12,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:217,“referenced_errors”:[{“created”:“@1693193958.870293202”,“description”:“Address already in use”,“errno”:98,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:191,“os_error”:“Address already in use”,“syscall”:“bind”}]},{“created”:“@1693193958.870301493”,“description”:“Unable to configure socket”,“fd”:12,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:217,“referenced_errors”:[{“created”:“@1693193958.870300245”,“description”:“Address already in use”,“errno”:98,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:191,“os_error”:“Address already in use”,“syscall”:“bind”}]}]}]}

raylet, ip=192.168.1.100) [2023-08-28 03:39:18,893 C 4034351 4034351] grpc_server.cc:119: Check failed: server_ Failed to start the grpc server. The specified port is 18321. This means that Ray’s core components will not be able to function correctly. If the server startup error message is Address already in use, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running sudo lsof -i :18321 to check if there are other processes listening to the port.

Huaiwei_Sun · August 30, 2023, 5:54am

cc: @Ruiyang_Wang @sangcho

Ruiyang_Wang · August 30, 2023, 10:56pm

it looks like we have some occasional port assignment conflicts. Will fix this after a refactor of how we init the system. Before that maybe we can retry the job submission to see if it works?

Ruiyang_Wang · August 30, 2023, 11:05pm

Another fail-proof way is to manually assign ports Configuring Ray — Ray 2.6.1

virtualluke · August 31, 2023, 2:09am

We are assigning all the ports that can be hard-coded I believe (please correct me if not), it is the ports in the worker range (11000-65535) that seem to be getting in a bind (pun intended).

If something looks wrong in the following please let me know.

head node started with:

ray start --head --include-dashboard=false --disable-usage-stats --ray-client-server-port=1099 --redis-shard-ports=1100,1101,1102 --object-manager-port=1103 --node-manager-port=1104 --gcs-server-port=1105 --dashboard-agent-grpc-port=1106 --dashboard-agent-listen-port=1107 --metrics-export-port=1108 --dashboard-port=1109 --min-worker-port=11000 --max-worker-port=65535

worker nodes started with:

ray start --address=192.168.1.100:6379–object-manager-port=1103 --node-manager-port=1104 --dashboard-agent-grpc-port=1106 --dashboard-agent-listen-port=1107 --metrics-export-port=1108 --min-worker-port=11000 --max-worker-port=65535

sangcho · September 26, 2023, 9:06pm

Is it possible when you see the job failure, you can lsof -i :<port_number> to see which process is using the port?

Topic		Replies	Views
Ray crashes on Slurm Ray Clusters	6	1383	October 27, 2022
Ray cluster worker port Ray Clusters	9	992	December 8, 2023
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1704	December 15, 2020
Two grpc servers in same RAY	0	242	December 17, 2023
The Ray agent couldn't be started due to the port conflict. To solve the problem, start Ray with a hard-coded agent port	8	1485	August 30, 2023

Grpc port bind issue after multiple successful jobs

Related topics