How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
HI! I have a software pipeline consisting of ML models and muitiple other actors for I/O and other processing. Recently, I enccountered an error when I’m trying to stop a job on my cluster ( Check failed: addr_proto.worker_id() != “” ) for which I have a script to unsubscribe the actors from a particular job.
This is the ray log when it failed to stop a job
66[2024-06-30 12:44:12,759 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:237: Connecting to actor af97e61451879ea91b41323003000000 at worker 6b46c89f7e155c506f74a0dfadc32b7b6a38b08d9c92290083e8401a67[2024-06-30 12:44:12,760 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 2de6bbed4482d002071a15c205000000, ip address: 172.16.120.98, port: 10053, worker_id: 1288e554cf5b193db2ab5f2804861973bac53f9bf463596963f6fbe3, raylet_id: 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a, num_restarts: 0, death context type=CONTEXT_NOT_SET68[2024-06-30 12:44:12,760 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:237: Connecting to actor 2de6bbed4482d002071a15c205000000 at worker 1288e554cf5b193db2ab5f2804861973bac53f9bf463596963f6fbe369[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:825: task 387079a42b5466352de6bbed4482d002071a15c205000000 retries left: 0, oom retries left: 0, task failed due to oom: 070[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:841: No retries left for task 387079a42b5466352de6bbed4482d002071a15c205000000, not going to resubmit.71[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:563: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=387079a42b5466352de6bbed4482d002071a15c205000000, wait_queue_size=172e[2me[33m(raylet)e[0m [2024-06-30 12:44:12,815 C 260 260] (raylet) core_worker_client_pool.cc:32: Check failed: addr_proto.worker_id() != "" 73e[2me[33m(raylet)e[0m *** StackTrace Information ***74e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x56187a) [0x56388038b87a] ray::operator<<()75e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563262) [0x56388038d262] ray::SpdLogMessage::Flush()76e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563577) [0x56388038d577] ray::RayLog::~RayLog()77e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x48f204) [0x5638802b9204] ray::rpc::CoreWorkerClientPool::GetOrConnect()78e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x268e70) [0x563880092e70] std::_Function_handler<>::_M_invoke()79e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x496810) [0x5638802c0810] ray::pubsub::Subscriber::SendCommandBatchIfPossible()80e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497af9) [0x5638802c1af9] ray::pubsub::Subscriber::SubscribeInternal()81e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497f7a) [0x5638802c1f7a] ray::pubsub::Subscriber::Subscribe()82e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x33af88) [0x563880164f88] ray::OwnershipBasedObjectDirectory::SubscribeObjectLocations()83e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x308c90) [0x563880132c90] ray::ObjectManager::Pull()84e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x265e55) [0x56388008fe55] ray::raylet::DependencyManager::StartOrUpdateGetRequest()85e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2ac546) [0x5638800d6546] ray::raylet::NodeManager::ProcessFetchOrReconstructMessage()86e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2bbfa2) [0x5638800e5fa2] ray::raylet::NodeManager::ProcessClientMessage()87e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x203641) [0x56388002d641] std::_Function_handler<>::_M_invoke()88e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x50075d) [0x56388032a75d] ray::ClientConnection::ProcessMessage()89e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x541336) [0x56388036b336] EventTracker::RecordExecution()90e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f7ab2) [0x563880321ab2] boost::asio::detail::binder2<>::operator()()91e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f8208) [0x563880322208] boost::asio::detail::reactive_socket_recv_op<>::do_complete()92e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac4c1b) [0x5638808eec1b] boost::asio::detail::scheduler::do_run_one()93e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac71a9) [0x5638808f11a9] boost::asio::detail::scheduler::run()94e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac7662) [0x5638808f1662] boost::asio::io_context::run()95e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x17350a) [0x56387ff9d50a] main96e[2me[33m(raylet)e[0m /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa59bd3b083] __libc_start_main97e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x1bcb17) [0x56387ffe6b17]98e[2me[33m(raylet)e[0m
For reference, this is how the log for a successful job stop looks like:
115[2024-06-30 12:40:22,096 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:825: task c78d4c02847876b0793a0b4524f7c89146b5460e03000000 retries left: 0, oom retries left: 0, task failed due to oom: 0116[2024-06-30 12:40:22,096 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:841: No retries left for task c78d4c02847876b0793a0b4524f7c89146b5460e03000000, not going to resubmit.117[2024-06-30 12:40:22,096 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:563: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=c78d4c02847876b0793a0b4524f7c89146b5460e03000000, wait_queue_size=1118[2024-06-30 12:40:22,100 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 793a0b4524f7c89146b5460e03000000, ip address: 172.16.120.98, port: 10033, worker_id: 9a5be31690f4add0091b918dfd889f4a53d3e79afeab5bd5a5d75041, raylet_id: 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a, num_restarts: 0, death context type=ActorDiedErrorContext119[2024-06-30 12:40:22,100 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:286: Failing pending tasks for actor 793a0b4524f7c89146b5460e03000000 because the actor is already dead.120[2024-06-30 12:40:22,100 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:893: Task failed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=tools.camera_actor, class_name=camActor, function_name=kill_actor, function_hash=}, task_id=c78d4c02847876b0793a0b4524f7c89146b5460e03000000, task_name=camActor.kill_actor, job_id=03000000, num_args=0, num_returns=1, depth=1, attempt_number=0, actor_task_spec={actor_id=793a0b4524f7c89146b5460e03000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff04000000, actor_counter=5}121[2024-06-30 12:40:22,100 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:36: Set max pending calls to 0 for actor e060ceb7f27e054cb481f32f03000000122[2024-06-30 12:40:22,101 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: e060ceb7f27e054cb481f32f03000000, ip address: 172.16.120.98, port: 10030, worker_id: 7f67a1bdc42f6060c8b4dfd59ce90b3e9218ef79687d78baa90b71d3, raylet_id: 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a, num_restarts: 0, death context type=CONTEXT_NOT_SET123[2024-06-30 12:40:22,101 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:237: Connecting to actor e060ceb7f27e054cb481f32f03000000 at worker 7f67a1bdc42f6060c8b4dfd59ce90b3e9218ef79687d78baa90b71d3124[2024-06-30 12:40:22,776 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:825: task 459f914ab5dce627e060ceb7f27e054cb481f32f03000000 retries left: 0, oom retries left: 0, task failed due to oom: 0125[2024-06-30 12:40:22,776 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:841: No retries left for task 459f914ab5dce627e060ceb7f27e054cb481f32f03000000, not going to resubmit.126[2024-06-30 12:40:22,776 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:563: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=459f914ab5dce627e060ceb7f27e054cb481f32f03000000, wait_queue_size=1127[2024-06-30 12:40:22,779 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: e060ceb7f27e054cb481f32f03000000, ip address: 172.16.120.98, port: 10030, worker_id: 7f67a1bdc42f6060c8b4dfd59ce90b3e9218ef79687d78baa90b71d3, raylet_id: 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a, num_restarts: 0, death context type=ActorDiedErrorContext128[2024-06-30 12:40:22,779 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:286: Failing pending tasks for actor e060ceb7f27e054cb481f32f03000000 because the actor is already dead.129[2024-06-30 12:40:22,779 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:893: Task failed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=tools.log_actor, class_name=logActor, function_name=kill_actor, function_hash=}, task_id=459f914ab5dce627e060ceb7f27e054cb481f32f03000000, task_name=logActor.kill_actor, job_id=03000000, num_args=0, num_returns=1, depth=1, attempt_number=0, actor_task_spec={actor_id=e060ceb7f27e054cb481f32f03000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff04000000, actor_counter=5}130[2024-06-30 12:40:22,779 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:36: Set max pending calls to 0 for actor af97e61451879ea91b41323003000000131data in posting main {'job_id': '1239', 'message': 'Job stop successful', 'code': <StatusCode.STOP_JOB_SUCCESS: 2002>, 'streaming_url': ''}132[2024-06-30-12:40:28] job stop success, posting status False133[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:716: Disconnecting to the raylet.134[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0135[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:639: Shutting down a core worker.136[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_event_buffer.cc:197: Shutting down TaskEventBuffer.137[2024-06-30 12:40:34,310 I 2057 2123] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_event_buffer.cc:179: Task event buffer io service stopped.138[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:665: Disconnecting a GCS client.139[2024-06-30 12:40:34,310 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:669: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.140[2024-06-30 12:40:34,310 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:882: Core worker main io service stopped.141[2024-06-30 12:40:34,314 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:682: Core worker ready to be deallocated.142[2024-06-30 12:40:34,314 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:630: Core worker is destructed143[2024-06-30 12:40:34,314 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_event_buffer.cc:197: Shutting down TaskEventBuffer.144[2024-06-30 12:40:34,315 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker_process.cc:148: Destructing CoreWorkerProcessImpl. pid: 2057145[2024-06-30 12:40:34,315 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) io_service_pool.cc:47: IOServicePool is stopped.146[2024-06-30 12:40:34,394 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) stats.h:128: Stats module has shutdown.147Changed directory to: ./_ray_pkg_7bb999d73fe97c54
And here’s the raylet.err log:
[2024-06-30 12:44:12,815 C 260 260] (raylet) core_worker_client_pool.cc:32: Check failed: addr_proto.worker_id() != "" 2*** StackTrace Information ***3/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x56187a) [0x56388038b87a] ray::operator<<()4/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563262) [0x56388038d262] ray::SpdLogMessage::Flush()5/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563577) [0x56388038d577] ray::RayLog::~RayLog()6/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x48f204) [0x5638802b9204] ray::rpc::CoreWorkerClientPool::GetOrConnect()7/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x268e70) [0x563880092e70] std::_Function_handler<>::_M_invoke()8/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x496810) [0x5638802c0810] ray::pubsub::Subscriber::SendCommandBatchIfPossible()9/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497af9) [0x5638802c1af9] ray::pubsub::Subscriber::SubscribeInternal()10/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497f7a) [0x5638802c1f7a] ray::pubsub::Subscriber::Subscribe()11/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x33af88) [0x563880164f88] ray::OwnershipBasedObjectDirectory::SubscribeObjectLocations()12/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x308c90) [0x563880132c90] ray::ObjectManager::Pull()13/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x265e55) [0x56388008fe55] ray::raylet::DependencyManager::StartOrUpdateGetRequest()14/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2ac546) [0x5638800d6546] ray::raylet::NodeManager::ProcessFetchOrReconstructMessage()15/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2bbfa2) [0x5638800e5fa2] ray::raylet::NodeManager::ProcessClientMessage()16/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x203641) [0x56388002d641] std::_Function_handler<>::_M_invoke()17/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x50075d) [0x56388032a75d] ray::ClientConnection::ProcessMessage()18/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x541336) [0x56388036b336] EventTracker::RecordExecution()19/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f7ab2) [0x563880321ab2] boost::asio::detail::binder2<>::operator()()20/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f8208) [0x563880322208] boost::asio::detail::reactive_socket_recv_op<>::do_complete()21/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac4c1b) [0x5638808eec1b] boost::asio::detail::scheduler::do_run_one()22/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac71a9) [0x5638808f11a9] boost::asio::detail::scheduler::run()23/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac7662) [0x5638808f1662] boost::asio::io_context::run()24/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x17350a) [0x56387ff9d50a] main25/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa59bd3b083] __libc_start_main26/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x1bcb17) [0x56387ffe6b17]
Could anyone experienced with Ray clusters help me out with this issue? Any help is greatly appreciated!! Thanks a lot in advance!