Local Cluster - Failed to connect to GCS

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Creating a local cluster with ray start --head or python: import ray; ray.init() fails with the (repeated) message:

ERROR node.py:605 – Failed to connect to GCS. Please check gcs_server.out for more details.

The system is running Ubuntu 22.04, Python 3.10.6, and Ray 2.6.1. If you have any suggestions on how to resolve the issue or what additional information would be useful, it would be greatly appreciated!

Contents of gcs_server.out:

[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) event.cc:234: Set ray event level to warning
[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 9481472b4a9f4771b5910cb1db92f98072aab2ae6613e192b8925e48
[2023-07-24 18:16:47,724 I 44126 44126] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 65178.
[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:


Global stats: 28 total (16 active)
Queueing time: mean = 1.978 ms, max = 27.643 ms, min = 756.000 ns, total = 55.382 ms
Execution time:  mean = 988.495 us, total = 27.678 ms
Event stats:
	InternalKVGcsService.grpc_server.InternalKVPut - 6 total (5 active), CPU time: mean = 792.833 ns, total = 4.757 us
	GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
	InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), CPU time: mean = 0.000 s, total = 0.000 s
	PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 612.250 ns, total = 2.449 us
	GcsInMemoryStore.Put - 3 total (1 active), CPU time: mean = 9.217 ms, total = 27.651 ms
	UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us


[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:


Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:


[2023-07-24 18:16:57,738 W 44126 44130] (gcs_server) metric_exporter.cc:212: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-07-24 18:17:47,751 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2023-07-24 18:17:47,751 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:


Global stats: 316 total (4 active)
Queueing time: mean = 261.132 us, max = 27.643 ms, min = 756.000 ns, total = 82.518 ms
Execution time:  mean = 109.699 us, total = 34.665 ms
Event stats:
	GcsInMemoryStore.Put - 74 total (0 active), CPU time: mean = 397.753 us, total = 29.434 ms
	InternalKVGcsService.grpc_server.InternalKVPut - 72 total (0 active), CPU time: mean = 15.612 us, total = 1.124 ms
	InternalKVGcsService.grpc_client.InternalKVPut - 72 total (0 active), CPU time: mean = 16.162 us, total = 1.164 ms
	RayletLoadPulled - 60 total (1 active), CPU time: mean = 5.341 us, total = 320.487 us
	UNKNOWN - 20 total (1 active), CPU time: mean = 6.829 us, total = 136.574 us
	GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
	GCSServer.deadline_timer.debug_state_dump - 6 total (1 active), CPU time: mean = 389.623 us, total = 2.338 ms
	PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 32.203 us, total = 128.810 us
	GCSServer.deadline_timer.debug_state_event_stats_print - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s
	GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us


[2023-07-24 18:17:47,752 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:


Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:


[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:


Global stats: 619 total (4 active)
Queueing time: mean = 180.491 us, max = 27.643 ms, min = 756.000 ns, total = 111.724 ms
Execution time:  mean = 68.240 us, total = 42.241 ms
Event stats:
	GcsInMemoryStore.Put - 146 total (0 active), CPU time: mean = 212.025 us, total = 30.956 ms
	InternalKVGcsService.grpc_server.InternalKVPut - 144 total (0 active), CPU time: mean = 16.687 us, total = 2.403 ms
	InternalKVGcsService.grpc_client.InternalKVPut - 144 total (0 active), CPU time: mean = 15.482 us, total = 2.229 ms
	RayletLoadPulled - 120 total (1 active), CPU time: mean = 5.350 us, total = 641.943 us
	UNKNOWN - 40 total (1 active), CPU time: mean = 6.950 us, total = 277.984 us
	GCSServer.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 424.841 us, total = 5.098 ms
	GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
	PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 32.203 us, total = 128.810 us
	GCSServer.deadline_timer.debug_state_event_stats_print - 2 total (1 active, 1 running), CPU time: mean = 243.068 us, total = 486.136 us
	GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us


[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:


Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:



After more investigation this seems to be firewall related, specifically this OUTPUT iptables rule may be the problem:
DROP all -- anywhere 10.0.0.0/8

I don’t have a lot of network experience so I’m not sure if this is completely expected or an oddity of Ray - is there a way around this or do the firewall rules need to be modified?

I’ve solved the problem, though I don’t know enough to understand why this works. I had to set --node-ip-address=127.0.1.1 (separate entry in /etc/hosts) - setting --node-ip-address=localhost or --node-ip-address=127.0.0.1 did not work as Ray would just continue to use the internet IP address.

I think if you use the localhost, it will automatically translate the address to the IP address. Maybe when you specify the loopback address 127.0.1.1 (the second loopback addr), ray couldn’t detect it and used the localhost