Raylet exits abnormally when setting up a local Ray Cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m trying to build a local cluster of ray, but after starting the head node for a few seconds, it will report an error and exit. Please help me…

Python: 3.9.2
Ray: 2.2.0

2023-04-12 15:34:59,523 INFO event_agent.py:56 -- Report events to 10.10.0.155:42363
2023-04-12 15:34:59,523 INFO event_utils.py:131 -- Monitor events logs modified after 1681283099.4583993 on /tmp/ray/session_2023-04-12_15-34-57_074130_1345553/logs/events, the source types are all.
2023-04-12 15:35:10,094 WARNING agent.py:196 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:10,496 WARNING agent.py:196 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:10,897 WARNING agent.py:196 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,299 WARNING agent.py:196 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,701 WARNING agent.py:196 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,702 ERROR agent.py:249 -- Raylet is terminated: ip=10.10.0.155, id=6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] Queueing time: mean = 3.965 ms, max = 27.793 ms, min = 22.135 us, total = 87.235 ms
    [state-dump] Execution time:  mean = 1.542 ms, total = 33.925 ms
    [state-dump] Event stats:
    [state-dump]        PeriodicalRunner.RunFnPeriodically - 9 total (1 active, 1 running), CPU time: mean = 438.626 us, total = 3.948 ms
    [state-dump]        UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 29.451 ms, total = 29.451 ms
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 526.610 us, total = 526.610 us
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2023-04-12 15:34:58,986 I 1345751 1345751] (raylet) accessor.cc:612: Received notification for node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, IsAlive = 1
    [2023-04-12 15:34:59,522 I 1345751 1345751] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 10.10.0.155, port: 50379, id: 424238335

gcs_server.out

[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, address = 10.10.0.155, node name = 10.10.0.155
[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:48: Finished registering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, address = 10.10.0.155, node name = 10.10.0.155
[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_placement_group_manager.cc:760: A new node: 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 registered, will try to reschedule all the infeasible placement groups.
[2023-04-12 15:34:58,986 I 1345562 1345562] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2023-04-12 15:34:58,986 I 1345562 1345562] (gcs_server) gcs_job_manager.cc:155: Finished getting all job info.
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:79: Draining node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:212: Removing node, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, node name = 10.10.0.155
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_placement_group_manager.cc:732: Node 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 failed, rescheduling the placement groups on the dead node.
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_actor_manager.cc:989: Node 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 failed, reconstructing actors.
[2023-04-12 15:35:11,842 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:134: Raylet 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 is drained. Status GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: . The information will be published to the cluster.
[2023-04-12 15:35:17,304 W 1345562 1345575] (gcs_server) metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.

raylet.out

2023-04-12 15:34:58,986 I 1345751 1345751] (raylet) accessor.cc:612: Received notification for node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, IsAlive = 1
[2023-04-12 15:34:59,522 I 1345751 1345751] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 10.10.0.155, port: 50379, id: 424238335
[2023-04-12 15:35:11,777 I 1345751 1345830] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip 10.10.0.155. id 424238335
[2023-04-12 15:35:11,777 E 1345751 1345830] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
[2023-04-12 15:35:11,777 I 1345751 1345751] (raylet) main.cc:301: Raylet received SIGTERM, shutting down...
[2023-04-12 15:35:11,777 I 1345751 1345751] (raylet) accessor.cc:435: Unregistering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1
[2023-04-12 15:35:11,778 I 1345751 1345751] (raylet) io_service_pool.cc:47: IOServicePool is stopped.
1 Like

I have the same issue with the 2.3.0-1 I don’t have a clue what’s going on.
Tried multiple versions of python GRPC etc.

Are you sending jobs to the cluster or just running code on head node?