Raylet error Check failed: addr_proto.worker_id() != ""

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi!
I have a software pipeline that runs on a ray cluster. I submit jobs to the cluster through my script and recently I started getting an error whenever I try to stop a job.

The below text is the job log file from ray when the job is stopped successfully.

2total 723drwxr-xr-x 1 root root  4096 Jun 30 12:38 .4drwxr-xr-x 1 root root  4096 Apr  3 17:42 ..5-rwxrwxr-x 1 root root  9186 Apr  3 17:09 cargo_config.py6drwxr-xr-x 4 root root  4096 Jun 30 12:38 data7drwxrwxr-x 4 root root  4096 Nov 16  2023 engines8-rw-rw-r-- 1 root root 20855 Apr  3 17:09 main.py9drwxrwxr-x 4 root root  4096 Nov 16  2023 model_data10-rwxr-xr-x 1 root root   729 Jun 30 12:38 run.sh11drwxrwxr-x 2 root root  4096 Apr  3 17:09 tools12drwxrwxr-x 2 root root  4096 Apr  3 17:09 tracker13{"job_id": "1239", "op_type": "stop", "cam_streaming_url": "rtsp://admin:Welcome234@10.60.62.83:1548/box_cam.ts", "survey_operation_type": "4", "server_ip": "http://172.16.120.62:3002"}14/tmp/ray/session_latest/runtime_resources/working_dir_files/_ray_pkg_7bb999d73fe97c54152024-06-30 12:40:21,577	INFO worker.py:1313 -- Using address 172.16.120.98:1234 set in the environment variable RAY_ADDRESS162024-06-30 12:40:21,577	WARNING worker.py:1396 -- Both RAY_JOB_CONFIG_JSON_ENV_VAR and ray.init(runtime_env) are provided, only using JSON_ENV_VAR to construct job_config. Please ensure no runtime_env is used in driver script's ray.init() when using job submission API.172024-06-30 12:40:21,577	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 172.16.120.98:1234...182024-06-30 12:40:21,583	INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at e[1me[32mhttp://172.16.120.98:8265 e[39me[22m19[2024-06-30 12:40:21,584 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 205720[2024-06-30 12:40:21,585 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) io_service_pool.cc:35: IOServicePool is running with 1 io_service.21[2024-06-30 12:40:21,586 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) grpc_server.cc:129: driver server started, listening on port 10045.22[2024-06-30 12:40:21,588 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:225: Initializing worker at address: 172.16.120.98:10045, worker ID 04000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a23[2024-06-30 12:40:21,588 I 2057 2057] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_event_buffer.cc:186: Reporting task events to GCS every 1000ms.24[2024-06-30 12:40:21,589 I 2057 2117] (python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff) core_worker.cc:568: Event stats:252627Global stats: 7 total (4 active)28Queueing time: mean = 1.768 us, max = 6.143 us, min = 3.012 us, total = 12.378 us29Execution time:  mean = 8.837 us, total = 61.857 us30Event stats:31	PeriodicalRunner.RunFnPeriodically - 2 total (1 active, 1 running), CPU time: mean = 2.244 us, total = 4.487 us32	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 12.376 us, total = 12.376 us33	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 44.994 us, total = 44.994 us34	UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s35	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

And this is the log when it fails:

66[2024-06-30 12:44:12,759 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:237: Connecting to actor af97e61451879ea91b41323003000000 at worker 6b46c89f7e155c506f74a0dfadc32b7b6a38b08d9c92290083e8401a67[2024-06-30 12:44:12,760 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 2de6bbed4482d002071a15c205000000, ip address: 172.16.120.98, port: 10053, worker_id: 1288e554cf5b193db2ab5f2804861973bac53f9bf463596963f6fbe3, raylet_id: 35e2c8771ce370fc4951a3be6ec96fe8d04278e12cfd062c7147b46a, num_restarts: 0, death context type=CONTEXT_NOT_SET68[2024-06-30 12:44:12,760 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:237: Connecting to actor 2de6bbed4482d002071a15c205000000 at worker 1288e554cf5b193db2ab5f2804861973bac53f9bf463596963f6fbe369[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:825: task 387079a42b5466352de6bbed4482d002071a15c205000000 retries left: 0, oom retries left: 0, task failed due to oom: 070[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) task_manager.cc:841: No retries left for task 387079a42b5466352de6bbed4482d002071a15c205000000, not going to resubmit.71[2024-06-30 12:44:12,792 I 3019 3044] (python-core-driver-06000000ffffffffffffffffffffffffffffffffffffffffffffffff) direct_actor_task_submitter.cc:563: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=387079a42b5466352de6bbed4482d002071a15c205000000, wait_queue_size=172e[2me[33m(raylet)e[0m [2024-06-30 12:44:12,815 C 260 260] (raylet) core_worker_client_pool.cc:32:  Check failed: addr_proto.worker_id() != "" 73e[2me[33m(raylet)e[0m *** StackTrace Information ***74e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x56187a) [0x56388038b87a] ray::operator<<()75e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563262) [0x56388038d262] ray::SpdLogMessage::Flush()76e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563577) [0x56388038d577] ray::RayLog::~RayLog()77e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x48f204) [0x5638802b9204] ray::rpc::CoreWorkerClientPool::GetOrConnect()78e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x268e70) [0x563880092e70] std::_Function_handler<>::_M_invoke()79e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x496810) [0x5638802c0810] ray::pubsub::Subscriber::SendCommandBatchIfPossible()80e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497af9) [0x5638802c1af9] ray::pubsub::Subscriber::SubscribeInternal()81e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497f7a) [0x5638802c1f7a] ray::pubsub::Subscriber::Subscribe()82e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x33af88) [0x563880164f88] ray::OwnershipBasedObjectDirectory::SubscribeObjectLocations()83e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x308c90) [0x563880132c90] ray::ObjectManager::Pull()84e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x265e55) [0x56388008fe55] ray::raylet::DependencyManager::StartOrUpdateGetRequest()85e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2ac546) [0x5638800d6546] ray::raylet::NodeManager::ProcessFetchOrReconstructMessage()86e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2bbfa2) [0x5638800e5fa2] ray::raylet::NodeManager::ProcessClientMessage()87e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x203641) [0x56388002d641] std::_Function_handler<>::_M_invoke()88e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x50075d) [0x56388032a75d] ray::ClientConnection::ProcessMessage()89e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x541336) [0x56388036b336] EventTracker::RecordExecution()90e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f7ab2) [0x563880321ab2] boost::asio::detail::binder2<>::operator()()91e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f8208) [0x563880322208] boost::asio::detail::reactive_socket_recv_op<>::do_complete()92e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac4c1b) [0x5638808eec1b] boost::asio::detail::scheduler::do_run_one()93e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac71a9) [0x5638808f11a9] boost::asio::detail::scheduler::run()94e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac7662) [0x5638808f1662] boost::asio::io_context::run()95e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x17350a) [0x56387ff9d50a] main96e[2me[33m(raylet)e[0m /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa59bd3b083] __libc_start_main97e[2me[33m(raylet)e[0m /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x1bcb17) [0x56387ffe6b17]98e[2me[33m(raylet)e[0m 

And this is the raylet.err log

[2024-06-30 12:44:12,815 C 260 260] (raylet) core_worker_client_pool.cc:32:  Check failed: addr_proto.worker_id() != "" 2*** StackTrace Information ***3/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x56187a) [0x56388038b87a] ray::operator<<()4/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563262) [0x56388038d262] ray::SpdLogMessage::Flush()5/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x563577) [0x56388038d577] ray::RayLog::~RayLog()6/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x48f204) [0x5638802b9204] ray::rpc::CoreWorkerClientPool::GetOrConnect()7/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x268e70) [0x563880092e70] std::_Function_handler<>::_M_invoke()8/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x496810) [0x5638802c0810] ray::pubsub::Subscriber::SendCommandBatchIfPossible()9/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497af9) [0x5638802c1af9] ray::pubsub::Subscriber::SubscribeInternal()10/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x497f7a) [0x5638802c1f7a] ray::pubsub::Subscriber::Subscribe()11/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x33af88) [0x563880164f88] ray::OwnershipBasedObjectDirectory::SubscribeObjectLocations()12/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x308c90) [0x563880132c90] ray::ObjectManager::Pull()13/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x265e55) [0x56388008fe55] ray::raylet::DependencyManager::StartOrUpdateGetRequest()14/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2ac546) [0x5638800d6546] ray::raylet::NodeManager::ProcessFetchOrReconstructMessage()15/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x2bbfa2) [0x5638800e5fa2] ray::raylet::NodeManager::ProcessClientMessage()16/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x203641) [0x56388002d641] std::_Function_handler<>::_M_invoke()17/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x50075d) [0x56388032a75d] ray::ClientConnection::ProcessMessage()18/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x541336) [0x56388036b336] EventTracker::RecordExecution()19/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f7ab2) [0x563880321ab2] boost::asio::detail::binder2<>::operator()()20/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x4f8208) [0x563880322208] boost::asio::detail::reactive_socket_recv_op<>::do_complete()21/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac4c1b) [0x5638808eec1b] boost::asio::detail::scheduler::do_run_one()22/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac71a9) [0x5638808f11a9] boost::asio::detail::scheduler::run()23/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0xac7662) [0x5638808f1662] boost::asio::io_context::run()24/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x17350a) [0x56387ff9d50a] main25/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa59bd3b083] __libc_start_main26/usr/local/lib/python3.8/dist-packages/ray/core/src/ray/raylet/raylet(+0x1bcb17) [0x56387ffe6b17]

Could anyone experienced with clusters please look at this issue? Your help is greatly appreciated. Thanks!