Raylet exits abnormally when setting up a local Ray Cluster

ZhengSJ · April 12, 2023, 7:48am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m trying to build a local cluster of ray, but after starting the head node for a few seconds, it will report an error and exit. Please help me…

Python: 3.9.2
Ray: 2.2.0

2023-04-12 15:34:59,523 INFO event_agent.py:56 -- Report events to 10.10.0.155:42363
2023-04-12 15:34:59,523 INFO event_utils.py:131 -- Monitor events logs modified after 1681283099.4583993 on /tmp/ray/session_2023-04-12_15-34-57_074130_1345553/logs/events, the source types are all.
2023-04-12 15:35:10,094 WARNING agent.py:196 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:10,496 WARNING agent.py:196 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:10,897 WARNING agent.py:196 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,299 WARNING agent.py:196 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,701 WARNING agent.py:196 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-04-12 15:35:11,702 ERROR agent.py:249 -- Raylet is terminated: ip=10.10.0.155, id=6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] Queueing time: mean = 3.965 ms, max = 27.793 ms, min = 22.135 us, total = 87.235 ms
    [state-dump] Execution time:  mean = 1.542 ms, total = 33.925 ms
    [state-dump] Event stats:
    [state-dump]        PeriodicalRunner.RunFnPeriodically - 9 total (1 active, 1 running), CPU time: mean = 438.626 us, total = 3.948 ms
    [state-dump]        UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 29.451 ms, total = 29.451 ms
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 526.610 us, total = 526.610 us
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2023-04-12 15:34:58,986 I 1345751 1345751] (raylet) accessor.cc:612: Received notification for node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, IsAlive = 1
    [2023-04-12 15:34:59,522 I 1345751 1345751] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 10.10.0.155, port: 50379, id: 424238335

gcs_server.out

[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, address = 10.10.0.155, node name = 10.10.0.155
[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:48: Finished registering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, address = 10.10.0.155, node name = 10.10.0.155
[2023-04-12 15:34:58,980 I 1345562 1345562] (gcs_server) gcs_placement_group_manager.cc:760: A new node: 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 registered, will try to reschedule all the infeasible placement groups.
[2023-04-12 15:34:58,986 I 1345562 1345562] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2023-04-12 15:34:58,986 I 1345562 1345562] (gcs_server) gcs_job_manager.cc:155: Finished getting all job info.
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:79: Draining node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:212: Removing node, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, node name = 10.10.0.155
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_placement_group_manager.cc:732: Node 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 failed, rescheduling the placement groups on the dead node.
[2023-04-12 15:35:11,778 I 1345562 1345562] (gcs_server) gcs_actor_manager.cc:989: Node 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 failed, reconstructing actors.
[2023-04-12 15:35:11,842 I 1345562 1345562] (gcs_server) gcs_node_manager.cc:134: Raylet 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1 is drained. Status GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: . The information will be published to the cluster.
[2023-04-12 15:35:17,304 W 1345562 1345575] (gcs_server) metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.

raylet.out

2023-04-12 15:34:58,986 I 1345751 1345751] (raylet) accessor.cc:612: Received notification for node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1, IsAlive = 1
[2023-04-12 15:34:59,522 I 1345751 1345751] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 10.10.0.155, port: 50379, id: 424238335
[2023-04-12 15:35:11,777 I 1345751 1345830] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip 10.10.0.155. id 424238335
[2023-04-12 15:35:11,777 E 1345751 1345830] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
[2023-04-12 15:35:11,777 I 1345751 1345751] (raylet) main.cc:301: Raylet received SIGTERM, shutting down...
[2023-04-12 15:35:11,777 I 1345751 1345751] (raylet) accessor.cc:435: Unregistering node info, node id = 6a1f6ffd518c466e3741714fb810f5dd656f7a34d9d7b81994f6f7d1
[2023-04-12 15:35:11,778 I 1345751 1345751] (raylet) io_service_pool.cc:47: IOServicePool is stopped.

nietsnie · April 19, 2023, 3:16pm

I have the same issue with the 2.3.0-1 I don’t have a clue what’s going on.
Tried multiple versions of python GRPC etc.

Are you sending jobs to the cluster or just running code on head node?

Topic		Replies	Views
Head and worked node dies after few seconds Kubernetes	3	1179	March 24, 2023
Periodic _MultiThreadedRendezvous failure leaves cluster in damaged state Ray Core	7	1776	December 10, 2021
Raylet error Check failed: addr_proto.worker_id() != "" Ray Clusters	0	12	June 30, 2024
Ray head crashed silently Ray Clusters	6	96	September 25, 2024
Ray Actor Dying unexpectedly Ray Core	8	3756	October 21, 2022

Raylet exits abnormally when setting up a local Ray Cluster

Related topics