ERROR gcs_utils.py:137 -- Failed to send request to gcs

I’m facing issue setting up a cluster manually using ray start. The head node starts successfully and I see the below message.

Then on a different machine which I want to be the worker,

$ nmap -sV --reason -p $PORT $HOST

Starting Nmap 6.40 ( http://nmap.org ) at 2022-02-05 16:37 UTC
Nmap scan report for 100.96.24.172
Host is up, received conn-refused (0.0011s latency).
PORT     STATE SERVICE REASON  VERSION
6379/tcp open  redis   syn-ack Redis key-value store

Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 16.17 seconds

$ ray start --address='100.96.24.172:6379' --redis-password='5241590000000000'
Local node IP: 100.96.191.45
2022-02-05 16:39:15,662 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1644079155.661725492","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644079155.661724203","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-02-05 16:39:16,665 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1644079156.664823970","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644079156.664822821","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

What are those other addresses and how to find out the reason why it is unable to connect? This looks similar to an unanswered question below.

As a sanity test, I started python -m http.server 8081 on the head node and curled for a file from the worker node. And it worked fine.

@mwtian is this something you can take a look at?

Nitin, it’s possible that there’s a number of ports that need to be opened for Ray to work. See Configuring Ray — Ray v1.10.0

1 Like

Thanks for resource, Richard. To be completely sure I explicitly specified all the mentioned ports when starting the head process.

ray start --head --port=8080 -v --node-manager-port=8081 --object-manager-port=8082 --ray-client-server-port=8083 --gcs-server-port=8084 --redis-shard-ports=8085,8086,8087 --include-dashboard=false --min-worker-port=8090 --max-worker-port=8100

which successfully started the head node with output,

Local node IP: 100.96.132.173
2022-02-06 06:07:23,371	WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only ...

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='100.96.132.173:8080' --redis-password='5241590000000000'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')
  
  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:8083')
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop

On the worker node, I get the same error on ray start --address='100.96.132.173:8080' --redis-password='5241590000000000'. However, looks like GCS process didn’t even start on the head due to which the worker is failing to connect.

$ nmap -sV --reason -p 8084 100.96.132.173

Starting Nmap 6.40 ( http://nmap.org ) at 2022-02-06 06:08 UTC
Nmap scan report for 100.96.132.173
Host is up, received conn-refused (0.00048s latency).
PORT     STATE  SERVICE REASON       VERSION
8084/tcp closed unknown conn-refused

Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds
  • What am I missing that GCS process did not start?
  • Do I need to be a root or something to execute ray start --head ...?

I found the location of logs just now and realized that GCS indeed started on 8084 as requested.

bash-4.2$ cat /tmp/ray/session_latest/logs/gcs_server.out
[2022-02-06 06:07:22,358 I 293 293] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-02-06 06:07:22,358 I 293 293] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
[2022-02-06 06:07:22,358 I 293 293] gcs_init_data.cc:42: Loading job table data.
[2022-02-06 06:07:22,358 I 293 293] gcs_init_data.cc:54: Loading node table data.
[2022-02-06 06:07:22,358 I 293 293] gcs_init_data.cc:66: Loading cluster resources table data.
[2022-02-06 06:07:22,358 I 293 293] gcs_init_data.cc:93: Loading actor table data.
[2022-02-06 06:07:22,358 I 293 293] gcs_init_data.cc:79: Loading placement group table data.
[2022-02-06 06:07:22,359 I 293 293] gcs_init_data.cc:46: Finished loading job table data, size = 0
[2022-02-06 06:07:22,359 I 293 293] gcs_init_data.cc:58: Finished loading node table data, size = 0
[2022-02-06 06:07:22,359 I 293 293] gcs_init_data.cc:70: Finished loading cluster resources table data, size = 0
[2022-02-06 06:07:22,359 I 293 293] gcs_init_data.cc:97: Finished loading actor table data, size = 0
[2022-02-06 06:07:22,359 I 293 293] gcs_init_data.cc:84: Finished loading placement group table data, size = 0
[2022-02-06 06:07:22,359 I 293 293] gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=30
[2022-02-06 06:07:22,374 I 293 293] grpc_server.cc:112: GcsServer server started, listening on port 8084.
[2022-02-06 06:07:22,384 I 293 293] gcs_server.cc:339: Gcs server address = 100.96.132.173:8084
[2022-02-06 06:07:22,384 I 293 293] gcs_server.cc:343: Finished setting gcs server address: 100.96.132.173:8084
[2022-02-06 06:07:22,384 I 293 293] gcs_server.cc:531: GcsNodeManager: {RegisterNode request count: 0, DrainNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, GetAllActorInfo request count: 0, KillActor request count: 0, ListNamedActors request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGrouprequest count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, GetNamedPlacementGroup request count: 0, Scheduling pending placement group count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
GrpcBasedResourceBroadcaster: {Tracked nodes: 0}
[2022-02-06 06:07:23,398 I 293 293] gcs_node_manager.cc:42: Registering node info, node id = d0ea1c5387f81ad72023fe9b12b24a9734b3ec95bb33220c5bac20d4, address = 100.96.132.173
[2022-02-06 06:07:23,398 I 293 293] gcs_node_manager.cc:47: Finished registering node info, node id = d0ea1c5387f81ad72023fe9b12b24a9734b3ec95bb33220c5bac20d4, address = 100.96.132.173
[2022-02-06 06:07:23,398 I 293 293] gcs_placement_group_manager.cc:722: A new node: d0ea1c5387f81ad72023fe9b12b24a9734b3ec95bb33220c5bac20d4 registered, will try to reschedule all the infeasible placement groups.
[2022-02-06 06:07:23,399 I 293 293] gcs_job_manager.cc:140: Getting all job info.
[2022-02-06 06:07:23,400 I 293 293] gcs_job_manager.cc:146: Finished getting all job info.

Similarly, NodeManager also started along with Redis servers. Not sure now why I see errors like

2022-02-05 16:39:16,665 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1644079156.664823970","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644079156.664822821","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

on the worker now, even though I’m able to talk to a simple.http server on 8084 from the same pair of hosts.

Also cc @sangcho since one thing we can improve here is the error messaging (which port failed to send request at)

1 Like

Yeah, that would help. But in this case Failed to send request to gcs message may not be due to port being unreachable. At least I was able to curl to a simple HTTP server hosted on 8084 (same as GCS). It is due to something else, which we need to figure out.

I am facing the same issue but with the nightly build. After running the cluster for a while, it crashes inconsistently.

I am using 1.9.2 with Python 3.7.10 on all nodes.

@varunjammula for what version of Ray did it work for you earlier?

I’ve used 1.1.0 before without any problems on a different task on python3.7

Thanks for the hint to downgrade @varunjammula

I’m able to create a cluster manually using the above process in 1.1.0

Ray team, let me know if you prefer I raise a bug for tracking, what’s going on in 1.9.2

1 Like

Useful fixes like https://github.com/ray-project/ray/issues/17534 and https://github.com/ray-project/ray/issues/20681 will be missed with Ray < 1.9 though

This seems like regression if it works for 1.1.0. Can you create an issue? Also, would it be possible to help bisecting here? For example, is the following workflow working for 1.8?

What am I missing that GCS process did not start?

It only starts at a head node.

Do I need to be a root or something to execute ray start --head ... ?

It is not required.

Sure, I’ll create an issue. But why 1.8? Is there any specific change that went in 1.9 that you suspect broke the GCS worker service?

oh no. I just mentioned one of the versions between 1.1 <-> 1.9.

Btw, did you make sure all of the ports are open?

Btw, It also looks like the line that generates the error message was first introduced from 1.9.

So, if this works on 1.8, I guess this has something to do with the new connection logic. (Maybe this is not working well with your environment).

1 Like

Let me know how it goes! If it works on 1.8, we can start investigating issues

@Nitin_Pasumarthy just to make sure, is my understanding below correct?

  • On Ray 1.9 after running ray start --head ... --gcs-server-port=8084 ... on head node, host:8084 is not accessible from the worker node with curl / nmap -sV --reason -p $PORT $HOST
  • But if a simple webserver is started on 8084 on the head node, it is accessible from the worker node with curl / nmap -sV --reason -p $PORT $HOST

I think we can focus on debugging why GCS is not serving traffic from 8084 on the head node. Is GCS process (ray/python/ray/core/src/ray/gcs/gcs_server) alive after ray start --head ... --gcs-server-port=8084 ...? And if not, is there any error in the GCS log?

Bisecting which Ray version started to have issues will definitely help us too!

Sorry both of you for the delays on my end.

Min, your understanding is correct. However I did not use nmap to test 8084’s visibility from worker. I just curled some files successfully.

I’ll try the 1.8 by weekend and post my updates here. Thanks again for looking into this.