Ray Cluster crash

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am using a Ray cluster to process over 1.4TB of data, and 99% of the data has been processed successfully. However, the remaining 1% of the data encounters an error, which causes the worker nodes in the cluster to enter a dead state. I have identified that the issue is related to the text length being too long (with a limit of 100 characters for processing).
ESC[33m(raylet, ip=10.31.12.158)ESC[0m [2024-08-12 13:34:06,725 C 418015 418015] (raylet) local_resource_manager.cc:108: Check failed: (left >= right) 1608775913470000 vs 1613555873090000ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m *** StackTrace Information ***ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb8124a) [0x56475e4cb24a] ray::operator<<()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb82a07) [0x56475e4cca07] ray::SpdLogMessage::Flush()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb82ea7) [0x56475e4ccea7] ray::RayLog::~RayLog()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x445635) [0x56475dd8f635] ray::LocalResourceManager::FreeTaskResourceInstances()ESC[32m [repeated 6x across c
luster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x44572c) [0x56475dd8f72c] ray::LocalResourceManager::ReleaseWorkerResources()ESC[32m [repeated 6x across clus
ter]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x3666b7) [0x56475dcb06b7] ray::raylet::LocalTaskManager::ReleaseWorkerResources()ESC[32m [repeated 6x across
cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2ed967) [0x56475dc37967] ray::raylet::NodeManager::DisconnectClient()ESC[32m [repeated 6x across cluster]ESC
[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2388) [0x56475dc3c388] ray::raylet::NodeManager::ProcessDisconnectClientMessage()ESC[32m [repeated 6x acro
ss cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2f25eb) [0x56475dc3c5eb] ray::raylet::NodeManager::ProcessClientMessage()ESC[32m [repeated 6x across cluster
]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x30e1d1) [0x56475dc581d1] std::_Function_handler<>::_M_invoke()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5a239d) [0x56475deec39d] ray::ClientConnection::ProcessMessage()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b1fae) [0x56475defbfae] EventTracker::RecordExecution()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m ESC[32m [repeated 20x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5997c8) [0x56475dee37c8] boost::asio::detail::reactive_socket_recv_op<>::do_complete()ESC[32m [repeated 6x a
cross cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc62acb) [0x56475e5acacb] boost::asio::detail::scheduler::do_run_one()ESC[32m [repeated 6x across cluster]ESC
[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc65059) [0x56475e5af059] boost::asio::detail::scheduler::run()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc65572) [0x56475e5af572] boost::asio::io_context::run()ESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1ce3c2) [0x56475db183c2] mainESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(raylet, ip=10.31.12.158)ESC[0m /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f9479478083] __libc_start_mainESC[32m [repeated 6x across cluster]ESC[0m
ESC[33m(autoscaler +33m21s)ESC[0m Warning: The following resource request cannot be scheduled right now: {‘node’: 1.0, ‘CPU’: 1.0, ‘memory’: 2147483648.0, ‘object_store_memory’: 1073741824.0}. This is likely due to all cluster resources being c
laimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
ESC[33m(raylet)ESC[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last
20 lines of the Raylet logs:
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb82a07) [0x56475e4cca07] ray::SpdLogMessage::Flush()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb82ea7) [0x56475e4ccea7] ray::RayLog::~RayLog()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x445635) [0x56475dd8f635] ray::LocalResourceManager::FreeTaskResourceInstances()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x44572c) [0x56475dd8f72c] ray::LocalResourceManager::ReleaseWorkerResources()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x3666b7) [0x56475dcb06b7] ray::raylet::LocalTaskManager::ReleaseWorkerResources()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2ed967) [0x56475dc37967] ray::raylet::NodeManager::DisconnectClient()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2388) [0x56475dc3c388] ray::raylet::NodeManager::ProcessDisconnectClientMessage()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2f25eb) [0x56475dc3c5eb] ray::raylet::NodeManager::ProcessClientMessage()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x30e1d1) [0x56475dc581d1] std::_Function_handler<>::_M_invoke()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5a239d) [0x56475deec39d] ray::ClientConnection::ProcessMessage()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b1fae) [0x56475defbfae] EventTracker::RecordExecution()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x599072) [0x56475dee3072] boost::asio::detail::binder2<>::operator()()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5997c8) [0x56475dee37c8] boost::asio::detail::reactive_socket_recv_op<>::do_complete()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc62acb) [0x56475e5acacb] boost::asio::detail::scheduler::do_run_one()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc65059) [0x56475e5af059] boost::asio::detail::scheduler::run()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc65572) [0x56475e5af572] boost::asio::io_context::run()
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1ce3c2) [0x56475db183c2] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f9479478083] __libc_start_main
/home/haodonglin/workspace/env_ray/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x222a17) [0x56475db6ca17]
ESC[32m [repeated 5x across cluster]ESC[0m

are you using ray data and have setup checkpointing? in which case you can resume where you left off after fixing the char string problem.