[Raylet died] What is the problem?

Xim_Lee · December 30, 2021, 1:30am

Hi, always thank everyone all for helping.
I’m using ray with the same settings, but this only happens on one computer. Does anyone know what the problem is in the process of executing the learned object? The error code is listed below.

Thank you.

INFO resource_spec.py:231 -- Starting Ray with 64.75 GiB memory available for workers and up to 31.74 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2021-12-29 19:41:05,898	INFO services.py:1193 -- View the Ray dashboard at localhost:8275
2021-12-29 19:41:06,964	WARNING deprecation.py:30 -- DeprecationWarning: `callbacks dict interface` has been deprecated. Use `a class extending rllib.agents.callbacks.DefaultCallbacks` instead. This will raise an error in the future!
2021-12-29 19:41:06,964	INFO trainer.py:632 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2021-12-29 19:41:06,966	WARNING deprecation.py:30 -- DeprecationWarning: `callbacks dict interface` has been deprecated. Use `a class extending rllib.agents.callbacks.DefaultCallbacks` instead. This will raise an error in the future!
2021-12-29 19:43:55,571	WARNING worker.py:1134 -- The node with node id 08582d0e0f48bc01c3526c38c2d002b11c52b1b4 has been marked dead because the detector has missed too many heartbeats from it.
(pid=raylet) F1229 19:43:57.603041 15766 15766 node_manager.cc:661]  Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @     0x563fd54fdd1d  google::LogMessage::Fail()
(pid=raylet)     @     0x563fd54fee7c  google::LogMessage::SendToLog()
(pid=raylet)     @     0x563fd54fd9f9  google::LogMessage::Flush()
(pid=raylet)     @     0x563fd54fdc11  google::LogMessage::~LogMessage()
(pid=raylet)     @     0x563fd54e90d9  ray::RayLog::~RayLog()
(pid=raylet)     @     0x563fd5201ef3  ray::raylet::NodeManager::NodeRemoved()
(pid=raylet)     @     0x563fd52021bc  _ZNSt17_Function_handlerIFvRKN3ray8ClientIDERKNS0_3rpc11GcsNodeInfoEEZNS0_6raylet11NodeManager11RegisterGcsEvEUlS3_S7_E0_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet)     @     0x563fd52e6550  ray::gcs::ServiceBasedNodeInfoAccessor::HandleNotification()
(pid=raylet)     @     0x563fd52e6826  _ZNSt17_Function_handlerIFvRKSsS1_EZZN3ray3gcs28ServiceBasedNodeInfoAccessor26AsyncSubscribeToNodeChangeERKSt8functionIFvRKNS3_8ClientIDERKNS3_3rpc11GcsNodeInfoEEERKS6_IFvNS3_6StatusEEEENKUlSM_E0_clESM_EUlS1_S1_E_E9_M_invokeERKSt9_Any_dataS1_S1_
(pid=raylet)     @     0x563fd52f0c4a  _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray3gcs13CallbackReplyEEEZNS2_9GcsPubSub24ExecuteCommandIfPossibleERKSsRNS6_7ChannelEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
(pid=raylet)     @     0x563fd52f262b  _ZN5boost4asio6detail18completion_handlerIZN3ray3gcs20RedisCallbackManager12CallbackItem8DispatchERSt10shared_ptrINS4_13CallbackReplyEEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet)     @     0x563fd57e401f  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @     0x563fd57e5521  boost::asio::detail::scheduler::run()
(pid=raylet)     @     0x563fd57e6552  boost::asio::io_context::run()
(pid=raylet)     @     0x563fd515a69e  main
(pid=raylet)     @     0x7f4c65be8bf7  __libc_start_main
(pid=raylet)     @     0x563fd516c6b1  (unknown)
2021-12-29 19:47:15,051	INFO trainable.py:251 -- Trainable.setup took 368.088 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-12-29 19:47:15,051	WARNING util.py:37 -- Install gputil for GPU system monitoring.
Traceback (most recent call last):
  File "visualizer_rllib.py", line 470, in <module>
    visualizer_rllib(args)
  File "visualizer_rllib.py", line 159, in visualizer_rllib
    agent.restore(checkpoint)
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/tune/trainable.py", line 467, in restore
    self.load_checkpoint(checkpoint_path)
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 685, in load_checkpoint
    self.__setstate__(extra_data)
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 125, in __setstate__
    Trainer.__setstate__(self, state)
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1185, in __setstate__
    remote_state = ray.put(state["worker"])
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/worker.py", line 1570, in put
    object_ref = worker.put_object(value, pin_object=not weakref)
  File "/home/user/anaconda3/envs/flow/lib/python3.7/site-packages/ray/worker.py", line 274, in put_object
    pin_object=pin_object))
  File "python/ray/_raylet.pyx", line 791, in ray._raylet.CoreWorker.put_serialized_object
  File "python/ray/_raylet.pyx", line 759, in ray._raylet.CoreWorker._create_put_buffer
  File "python/ray/_raylet.pyx", line 151, in ray._raylet.check_status
ray.exceptions.RayletError: The Raylet died with this message: Broken pipe
E1229 19:47:15.560415 15706 15706 raylet_client.cc:124] IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.

sangcho · January 4, 2022, 11:36am

Can you provide full logs of raylet.out? Logging — Ray v2.0.0.dev0

Topic		Replies	Views
The task's local raylet died	0	708	September 22, 2023
Raylet crashes suddenly during training Ray Core	2	376	March 14, 2023
Ray Actor Dying unexpectedly Ray Core	8	3755	October 21, 2022
Error while running ray function - The task's local raylet died	2	558	December 7, 2024
Raylet exits abnormally when setting up a local Ray Cluster Ray Clusters	1	764	April 19, 2023

[Raylet died] What is the problem?

Related topics