I’ve been getting error like this for about a week now and I haven’t been able to nail down what is causing it. The only thing they have in common is, they happen after a few hours of inactivity. I will run some things with Ray then run more a few hours later and I get these types of errors. Any idea what the issue could be from these logs? Any suggestions on how to track the issue down?
Some information on the architecture of the project. I have a multiplexed deployment that is serving a model. The processing happens with ray.datasets. Everything is running locally.
Things i’ve tried:
- limiting resources / cpus / memory
- increasing memory on the cluster
- restructuring the project
- reconfiguring the autoscaling for the model
System:
WSL - Ubuntu
Ray 2.40.0
(run_pipeline pid=22219) 23:39:15.579 | INFO | prefect.task_runner.ray - Local Ray instance is already initialized. Using existing local instance.
(run_pipeline pid=22219) 23:39:15.677 | INFO | Task run 'run_pipeline-a36' - Setting memory to 30 GB
(run_pipeline pid=22219) 23:39:15.691 | INFO | Task run 'run_pipeline-a36' - Setting num_cpus to 30
(run_pipeline pid=22219) Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-05_15-39-53_332211_34574/logs/ray-data
(run_pipeline pid=22219) Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImages] -> TaskPoolMapOperator[Map(apply_transform)->MapBatches(model_response)->Map(write)]
(run_pipeline pid=22219) Truncating long operator name to 100 characters. To disable this behavior, set `ray.data.DataContext.get_current().DEFAULT_ENABLE_PROGRESS_BAR_NAME_TRUNCATION = False`.
(pid=22219) Running Dataset. Active & requested resources: 19/30 CPU, 4.8GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 19/30 CPU, 4.8GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 19/30 CPU, 4.8GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 21/30 CPU, 1.9GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 21/30 CPU, 1.9GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 22/30 CPU, 2.4GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 22/30 CPU, 2.4GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 24/30 CPU, 3.3GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 24/30 CPU, 3.3GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 29/30 CPU, 5.9GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 29/30 CPU, 5.9GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 28/30 CPU, 6.6GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 28/30 CPU, 6.6GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 28/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 28/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 22/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 22/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 19/30 CPU, 7.3GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 19/30 CPU, 7.3GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (Map(apply_transform)->MapBatches(model_response)->Map(exr) pid=22501) Starting model call for ['/mnt/q/users/jreeves/RayServeProject_test/sample_in/test_img.1009.jpg'
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (Map(apply_transform)->MapBatches(model_response)->Map(exr) pid=22501) '/mnt/q/users/jreeves/RayServeProject_test/sample_in/test_img.1009.jpg']
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (ServeController pid=35265) INFO 2024-12-05 23:39:30,481 controller 35265 -- Upscaling Deployment(name='depthPredictor', app='app1') from 0 to 1 replicas. Current ongoing requests: 1.00, current running replicas: 0.
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (ServeController pid=35265) INFO 2024-12-05 23:39:30,493 controller 35265 -- Adding 1 replica to Deployment(name='depthPredictor', app='app1').
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) [2024-12-05 23:39:30,697 C 34914 34914] (raylet) worker_pool.cc:1403: Check failed: worker->GetAssignedJobId().IsNil() || worker->GetAssignedJobId() == pop_worker_request->job_id
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) *** StackTrace Information ***
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc10eba) [0x55bbb5f54eba] ray::operator<<()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc132a1) [0x55bbb5f572a1] ray::RayLog::~RayLog()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x34e900) [0x55bbb5692900] ray::raylet::WorkerPool::PopWorker()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b68cc) [0x55bbb56fa8cc] ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b78f7) [0x55bbb56fb8f7] ray::raylet::LocalTaskManager::QueueAndScheduleTask()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a08d4) [0x55bbb57e48d4] ray::raylet::ClusterTaskManager::ScheduleOnNode()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3cce) [0x55bbb57e7cce] ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3904) [0x55bbb57e7904] ray::raylet::ClusterTaskManager::QueueAndScheduleTask()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x333a3e) [0x55bbb5677a3e] ray::raylet::NodeManager::HandleRequestWorkerLease()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x325507) [0x55bbb5669507] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x699dd8) [0x55bbb59dddd8] EventTracker::RecordExecution()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x694dce) [0x55bbb59d8dce] std::_Function_handler<>::_M_invoke()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x695246) [0x55bbb59d9246] boost::asio::detail::completion_handler<>::do_complete()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc8571b) [0x55bbb5fc971b] boost::asio::detail::scheduler::do_run_one()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc87ca9) [0x55bbb5fcbca9] boost::asio::detail::scheduler::run()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc881c2) [0x55bbb5fcc1c2] boost::asio::io_context::run()
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1ede65) [0x55bbb5531e65] main
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3efb2ffd90]
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f3efb2ffe40] __libc_start_main
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) /home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x248507) [0x55bbb558c507]
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet)
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (Map(apply_transform)->MapBatches(model_response)->Map(exr) pid=22582) '/mnt/q/users/jreeves/RayServeProject_test/sample_in/test_img.1009.jpg']
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x34e900) [0x55bbb5692900] ray::raylet::WorkerPool::PopWorker()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x34f060) [0x55bbb5693060] ray::raylet::WorkerPool::PopWorker()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b68cc) [0x55bbb56fa8cc] ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b78f7) [0x55bbb56fb8f7] ray::raylet::LocalTaskManager::QueueAndScheduleTask()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a08d4) [0x55bbb57e48d4] ray::raylet::ClusterTaskManager::ScheduleOnNode()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3cce) [0x55bbb57e7cce] ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3904) [0x55bbb57e7904] ray::raylet::ClusterTaskManager::QueueAndScheduleTask()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x333a3e) [0x55bbb5677a3e] ray::raylet::NodeManager::HandleRequestWorkerLease()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x325507) [0x55bbb5669507] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x699dd8) [0x55bbb59dddd8] EventTracker::RecordExecution()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x694dce) [0x55bbb59d8dce] std::_Function_handler<>::_M_invoke()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x695246) [0x55bbb59d9246] boost::asio::detail::completion_handler<>::do_complete()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc8571b) [0x55bbb5fc971b] boost::asio::detail::scheduler::do_run_one()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc87ca9) [0x55bbb5fcbca9] boost::asio::detail::scheduler::run()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc881c2) [0x55bbb5fcc1c2] boost::asio::io_context::run()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1ede65) [0x55bbb5531e65] main
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3efb2ffd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f3efb2ffe40] __libc_start_main
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x248507) [0x55bbb558c507]
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) The node with node id: db8c197ae32df0695eefb41d93f25c302bf11f51505e3bf389a4b028 and address: 100.10.4.333 and node name: 100.10.4.333 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (Map(apply_transform)->MapBatches(model_response)->Map(exr) pid=22582) Starting model call for ['/mnt/q/users/jreeves/RayServeProject_test/sample_in/test_img.1009.jpg'
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object sto (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x34e900) [0x55bbb5692900] ray::raylet::WorkerPool::PopWorker()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x34f060) [0x55bbb5693060] ray::raylet::WorkerPool::PopWorker()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b68cc) [0x55bbb56fa8cc] ray::raylet::LocalTaskManager::DispatchScheduledTasksToWorkers()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x3b78f7) [0x55bbb56fb8f7] ray::raylet::LocalTaskManager::QueueAndScheduleTask()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a08d4) [0x55bbb57e48d4] ray::raylet::ClusterTaskManager::ScheduleOnNode()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3cce) [0x55bbb57e7cce] ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a3904) [0x55bbb57e7904] ray::raylet::ClusterTaskManager::QueueAndScheduleTask()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x333a3e) [0x55bbb5677a3e] ray::raylet::NodeManager::HandleRequestWorkerLease()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x325507) [0x55bbb5669507] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x699dd8) [0x55bbb59dddd8] EventTracker::RecordExecution()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x694dce) [0x55bbb59d8dce] std::_Function_handler<>::_M_invoke()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x695246) [0x55bbb59d9246] boost::asio::detail::completion_handler<>::do_complete()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc8571b) [0x55bbb5fc971b] boost::asio::detail::scheduler::do_run_one()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc87ca9) [0x55bbb5fcbca9] boost::asio::detail::scheduler::run()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc881c2) [0x55bbb5fcc1c2] boost::asio::io_context::run()
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1ede65) [0x55bbb5531e65] main
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f3efb2ffd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f3efb2ffe40] __libc_start_main
/home/user/dev/RayServeProject/.conda/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x248507) [0x55bbb558c507]
(pid=22219) Running Dataset. Active & requested resources: 14/30 CPU, 7.2GB/8.3GB object store: : 0.00 row [00:26, ? row/s]