Grpc port bind issue after multiple successful jobs

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

After multiple successful ray jobs (after a ray cluster start) I get a random problem where
a next ray job cant seem to start the grpc_server due to a port already bound issue. Nothing is making use of these ports except for ray. If I restart the cluster I can then run a next sequence of jobs that work until the error below. I can sometimes run hundreds of jobs each making use of 10,000+ worker tasks before I get this error and sometimes it happens after only a handful. Very unreproducible on demand. Any thoughts? This is under redhat 9.2, python 3.11.4, ray 2.6.1 on a cluster with 20 nodes (20 nodes x 96 cores). If I do a restart of the cluster and try the job it goes through fine until some random number of runs and I get another error like the below.

I am getting an error like:

(raylet, ip=192.168.1.100) E0828 03:39:18.870348198 4034351 chttp2_server.cc:1051] {“created”:“@1693193958.870304813”,“description”:“No address added out of total 1 resolved for ‘0.0.0.0:18321’”,“file”:“external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc”,“file_line”:947,“referenced_errors”:[{“created”:“@1693193958.870301759”,“description”:“Failed to add any wildcard listeners”,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc”,“file_line”:357,“referenced_errors”:[{“created”:“@1693193958.870295770”,“description”:“Unable to configure socket”,“fd”:12,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:217,“referenced_errors”:[{“created”:“@1693193958.870293202”,“description”:“Address already in use”,“errno”:98,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:191,“os_error”:“Address already in use”,“syscall”:“bind”}]},{“created”:“@1693193958.870301493”,“description”:“Unable to configure socket”,“fd”:12,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:217,“referenced_errors”:[{“created”:“@1693193958.870300245”,“description”:“Address already in use”,“errno”:98,“file”:“external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc”,“file_line”:191,“os_error”:“Address already in use”,“syscall”:“bind”}]}]}]}

raylet, ip=192.168.1.100) [2023-08-28 03:39:18,893 C 4034351 4034351] grpc_server.cc:119: Check failed: server_ Failed to start the grpc server. The specified port is 18321. This means that Ray’s core components will not be able to function correctly. If the server startup error message is Address already in use, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running sudo lsof -i :18321 to check if there are other processes listening to the port.

cc: @Ruiyang_Wang @sangcho

it looks like we have some occasional port assignment conflicts. Will fix this after a refactor of how we init the system. Before that maybe we can retry the job submission to see if it works?

Another fail-proof way is to manually assign ports Configuring Ray — Ray 2.6.1

We are assigning all the ports that can be hard-coded I believe (please correct me if not), it is the ports in the worker range (11000-65535) that seem to be getting in a bind (pun intended).

If something looks wrong in the following please let me know.

head node started with:

ray start --head --include-dashboard=false --disable-usage-stats --ray-client-server-port=1099 --redis-shard-ports=1100,1101,1102 --object-manager-port=1103 --node-manager-port=1104 --gcs-server-port=1105 --dashboard-agent-grpc-port=1106 --dashboard-agent-listen-port=1107 --metrics-export-port=1108 --dashboard-port=1109 --min-worker-port=11000 --max-worker-port=65535

worker nodes started with:

ray start --address=192.168.1.100:6379–object-manager-port=1103 --node-manager-port=1104 --dashboard-agent-grpc-port=1106 --dashboard-agent-listen-port=1107 --metrics-export-port=1108 --min-worker-port=11000 --max-worker-port=65535

Is it possible when you see the job failure, you can lsof -i :<port_number> to see which process is using the port?