Hello.
I have a python package which requires HPC resources to run certain computations. Frustrated by hitting walltime limits when running on a single node with multiprocessing.Pool
, I turned to Ray to distribute the computation. Because I already had multiprocessing.Pool
in some places, I conditionally replaced it with Ray’s Pool API, additionally adding some @ray.remote
, ray.put
and ray.get
until I got something that seemed to run.
However, I now run into LocalRayletDiedError
intermittently during my simulations. In particular, I’m using the ray
branch of that package with the following kind of PBS script:
#!/bin/bash
#PBS -P xd2
#PBS -q normalbw
#PBS -l walltime=24:00:00
#PBS -l ncpus=168
#PBS -l mem=1536GB
#PBS -l jobfs=800GB
#PBS -l storage=scratch/xd2+gdata/xd2
#PBS -l wd
#PBS -o simple3d_unsteady_prescr_dc0.log
#PBS -e simple3d_unsteady_prescr_dc0.err
#PBS -N simple3d_unsteady_prescr_dc0
module purge
module load python3/3.11.7 python3-as-python
ulimit -c unlimited
python3 -m pytest tests/test_simple_shear_3d.py \
--outdir=out -v --runslow --ncpus=$PBS_NCPUS -k="direction_change[0]"
The [0]
can also be [1]
, [2.5]
, and [inf]
, which result in different stderr/stdout. Sometimes stderr is silent except for “exit code 1”, sometimes it gives a Ray traceback/python dump. Stdout sometimes mentions the LocalRayletDiedError
, or not. Scaling down the (memory footprint of the) simulation by changing some internal parameters seems to stop these errors from appearing.
Example .err file:
Loading python3/3.11.7
Loading requirement: intel-mkl/2023.2.0
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(misorientation_index pid=1183359) [2024-04-07 08:13:20,745 E 1183359 1185427] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 68x across cluster]
(raylet) [2024-04-07 08:13:23,291 E 1165290 1165290] (raylet) client_connection.cc:370: Broken Pipe happened during calling ServerConnection::DoAsyncWrites. [repeated 6x across cluster]
(raylet) [2024-04-07 08:13:23,227 E 1165290 1165290] (raylet) worker_pool.cc:550: Some workers of the worker process(1189719) have not registered within the timeout. The process is still alive, probably it's hanging during start. [repeated 3x across cluster]
(PoolActor pid=1172967) [repeated 15x across cluster]
(PoolActor pid=1172967) [2024-04-07 08:13:34,867 C 1172967 1172967] core_worker.cc:2818: Check failed: _s.ok() Bad status: IOError: Broken pipe [repeated 4x across cluster]
(PoolActor pid=1172967) *** StackTrace Information *** [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xfe373a) [0x14abf980373a] ray::operator<<() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xfe4ff7) [0x14abf9804ff7] ray::SpdLogMessage::Flush() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x14abf9805497] ray::RayLog::~RayLog() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PS7_IS8_ISN_bESaISU_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0x7bd) [0x14abf8fadc3d] ray::core::CoreWorker::ExecuteTask() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa1e07e) [0x14abf923e07e] std::_Function_handler<>::_M_invoke() [repeated 16x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x7cc50e) [0x14abf8fec50e] ray::core::InboundRequest::Accept() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x114) [0x14abf8fed524] ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x7d014b) [0x14abf8ff014b] ray::core::ActorSchedulingQueue::ScheduleRequests() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue3AddEllSt8functionIFvS2_IFvNS_6StatusES2_IFvvEES5_EEEES2_IFvRKS3_S7_EES7_RKSsRKSt10shared_ptrINS_27FunctionDescriptorInterfaceEENS_6TaskIDERKSt6vectorINS_3rpc15ObjectReferenceESaISO_EE+0x400) [0x14abf8ff1c60] ray::core::ActorSchedulingQueue::Add() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core28CoreWorkerDirectTaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyESt8functionIFvNS_6StatusES8_IFvvEESB_EE+0x119c) [0x14abf8fd2e6c] ray::core::CoreWorkerDirectTaskReceiver::HandleTask() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa24c8e) [0x14abf9244c8e] EventTracker::RecordExecution() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa1e4f6) [0x14abf923e4f6] boost::asio::detail::completion_handler<>::do_complete() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cd0eb) [0x14abf98ed0eb] boost::asio::detail::scheduler::do_run_one() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cea69) [0x14abf98eea69] boost::asio::detail::scheduler::run() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cf172) [0x14abf98ef172] boost::asio::io_context::run() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0xcd) [0x14abf8f7234d] ray::core::CoreWorker::RunTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x14abf8fb56cc] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x14abf8fb587d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x205d0b) [0x14ac09e1cd0b] method_vectorcall_NOARGS [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x33) [0x14ac09e16a33] PyObject_Vectorcall [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x686) [0x14ac09e9e2a6] _PyEval_EvalFrameDefault [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x285d24) [0x14ac09e9cd24] _PyEval_Vector [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(PyEval_EvalCode+0xa4) [0x14ac09e9cae4] PyEval_EvalCode [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32baed) [0x14ac09f42aed] run_eval_code_obj [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32ba7a) [0x14ac09f42a7a] run_mod [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32c031) [0x14ac09f43031] pyrun_file [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyRun_SimpleFileObject+0x1a2) [0x14ac09f42d22] _PyRun_SimpleFileObject [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyRun_AnyFileObject+0x44) [0x14ac09f42b64] _PyRun_AnyFileObject [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(Py_RunMain+0x2c6) [0x14ac09f49226] Py_RunMain [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(Py_BytesMain+0x29) [0x14ac09f48e49] Py_BytesMain [repeated 4x across cluster]
(PoolActor pid=1172967) /lib64/libc.so.6(__libc_start_main+0xe5) [0x14ac08ee2d85] __libc_start_main [repeated 4x across cluster]
(PoolActor pid=1172967) ray::PoolActor(_start+0x2e) [0x40069e] _start [repeated 4x across cluster]
(PoolActor pid=1169760) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2 [repeated 46x across cluster]
example log file (stdout) (too large for inline inclusion)
excerpt showing job details summary:
======================================================================================
Resource Usage on 2024-04-07 08:13:57:
Job Id: 113048560.gadi-pbs
Project: xd2
Exit Status: 1
Service Units: 198.52
NCPUs Requested: 168 NCPUs Used: 168
CPU Time Used: 24:08:24
Memory Requested: 1.5TB Memory Used: 236.85GB
Walltime requested: 24:00:00 Walltime Used: 00:55:50
JobFS requested: 800.0GB JobFS used: 62.21GB
======================================================================================
I have on purpose requested exorbitant resources to make sure I’m not hitting the PBS job limits. Even so, there are mentions of SIGSEGV due to high memory usage, and also large numbers of workers, although reported memory usage doesn’t reach anywhere near the requested size. In any case, running with half the resources (half as many CPUs and mem) results in the same failures.
I wonder if anyone more experienced with Ray has some insight to offer here.
Cheers