How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Creating a local cluster with ray start --head
or python: import ray; ray.init()
fails with the (repeated) message:
ERROR node.py:605 – Failed to connect to GCS. Please check
gcs_server.out
for more details.
The system is running Ubuntu 22.04, Python 3.10.6, and Ray 2.6.1. If you have any suggestions on how to resolve the issue or what additional information would be useful, it would be greatly appreciated!
Contents of gcs_server.out
:
[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) event.cc:234: Set ray event level to warning
[2023-07-24 18:16:47,722 I 44126 44126] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-07-24 18:16:47,723 I 44126 44126] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 9481472b4a9f4771b5910cb1db92f98072aab2ae6613e192b8925e48
[2023-07-24 18:16:47,724 I 44126 44126] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 65178.
[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:
Global stats: 28 total (16 active)
Queueing time: mean = 1.978 ms, max = 27.643 ms, min = 756.000 ns, total = 55.382 ms
Execution time: mean = 988.495 us, total = 27.678 ms
Event stats:
InternalKVGcsService.grpc_server.InternalKVPut - 6 total (5 active), CPU time: mean = 792.833 ns, total = 4.757 us
GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), CPU time: mean = 0.000 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 612.250 ns, total = 2.449 us
GcsInMemoryStore.Put - 3 total (1 active), CPU time: mean = 9.217 ms, total = 27.651 ms
UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us
[2023-07-24 18:16:47,751 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan s, total = 0.000 s
Event stats:
[2023-07-24 18:16:57,738 W 44126 44130] (gcs_server) metric_exporter.cc:212: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-07-24 18:17:47,751 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2023-07-24 18:17:47,751 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:
Global stats: 316 total (4 active)
Queueing time: mean = 261.132 us, max = 27.643 ms, min = 756.000 ns, total = 82.518 ms
Execution time: mean = 109.699 us, total = 34.665 ms
Event stats:
GcsInMemoryStore.Put - 74 total (0 active), CPU time: mean = 397.753 us, total = 29.434 ms
InternalKVGcsService.grpc_server.InternalKVPut - 72 total (0 active), CPU time: mean = 15.612 us, total = 1.124 ms
InternalKVGcsService.grpc_client.InternalKVPut - 72 total (0 active), CPU time: mean = 16.162 us, total = 1.164 ms
RayletLoadPulled - 60 total (1 active), CPU time: mean = 5.341 us, total = 320.487 us
UNKNOWN - 20 total (1 active), CPU time: mean = 6.829 us, total = 136.574 us
GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
GCSServer.deadline_timer.debug_state_dump - 6 total (1 active), CPU time: mean = 389.623 us, total = 2.338 ms
PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 32.203 us, total = 128.810 us
GCSServer.deadline_timer.debug_state_event_stats_print - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s
GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us
[2023-07-24 18:17:47,752 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan s, total = 0.000 s
Event stats:
[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:255: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:844: Event stats:
Global stats: 619 total (4 active)
Queueing time: mean = 180.491 us, max = 27.643 ms, min = 756.000 ns, total = 111.724 ms
Execution time: mean = 68.240 us, total = 42.241 ms
Event stats:
GcsInMemoryStore.Put - 146 total (0 active), CPU time: mean = 212.025 us, total = 30.956 ms
InternalKVGcsService.grpc_server.InternalKVPut - 144 total (0 active), CPU time: mean = 16.687 us, total = 2.403 ms
InternalKVGcsService.grpc_client.InternalKVPut - 144 total (0 active), CPU time: mean = 15.482 us, total = 2.229 ms
RayletLoadPulled - 120 total (1 active), CPU time: mean = 5.350 us, total = 641.943 us
UNKNOWN - 40 total (1 active), CPU time: mean = 6.950 us, total = 277.984 us
GCSServer.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 424.841 us, total = 5.098 ms
GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.580 us, total = 15.479 us
PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 32.203 us, total = 128.810 us
GCSServer.deadline_timer.debug_state_event_stats_print - 2 total (1 active, 1 running), CPU time: mean = 243.068 us, total = 486.136 us
GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 4.293 us, total = 4.293 us
[2023-07-24 18:18:47,752 I 44126 44126] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan s, total = 0.000 s
Event stats: