Distributed `multiprocessing.Pool` resulting in `LocalRayletDiedError` or silent SIGSEGV termination

Hello.

I have a python package which requires HPC resources to run certain computations. Frustrated by hitting walltime limits when running on a single node with multiprocessing.Pool, I turned to Ray to distribute the computation. Because I already had multiprocessing.Pool in some places, I conditionally replaced it with Ray’s Pool API, additionally adding some @ray.remote, ray.put and ray.get until I got something that seemed to run.

However, I now run into LocalRayletDiedError intermittently during my simulations. In particular, I’m using the ray branch of that package with the following kind of PBS script:

#!/bin/bash
#PBS -P xd2
#PBS -q normalbw
#PBS -l walltime=24:00:00
#PBS -l ncpus=168
#PBS -l mem=1536GB
#PBS -l jobfs=800GB
#PBS -l storage=scratch/xd2+gdata/xd2
#PBS -l wd
#PBS -o simple3d_unsteady_prescr_dc0.log
#PBS -e simple3d_unsteady_prescr_dc0.err
#PBS -N simple3d_unsteady_prescr_dc0

module purge
module load python3/3.11.7 python3-as-python

ulimit -c unlimited

python3 -m pytest tests/test_simple_shear_3d.py \
    --outdir=out -v --runslow --ncpus=$PBS_NCPUS -k="direction_change[0]"

The [0] can also be [1], [2.5], and [inf], which result in different stderr/stdout. Sometimes stderr is silent except for “exit code 1”, sometimes it gives a Ray traceback/python dump. Stdout sometimes mentions the LocalRayletDiedError, or not. Scaling down the (memory footprint of the) simulation by changing some internal parameters seems to stop these errors from appearing.

Example .err file:
Loading python3/3.11.7
  Loading requirement: intel-mkl/2023.2.0
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(PoolActor pid=1172855) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2
(misorientation_index pid=1183359) [2024-04-07 08:13:20,745 E 1183359 1185427] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 68x across cluster]
(raylet) [2024-04-07 08:13:23,291 E 1165290 1165290] (raylet) client_connection.cc:370: Broken Pipe happened during calling ServerConnection::DoAsyncWrites. [repeated 6x across cluster]
(raylet) [2024-04-07 08:13:23,227 E 1165290 1165290] (raylet) worker_pool.cc:550: Some workers of the worker process(1189719) have not registered within the timeout. The process is still alive, probably it's hanging during start. [repeated 3x across cluster]
(PoolActor pid=1172967)  [repeated 15x across cluster]
(PoolActor pid=1172967) [2024-04-07 08:13:34,867 C 1172967 1172967] core_worker.cc:2818:  Check failed: _s.ok() Bad status: IOError: Broken pipe [repeated 4x across cluster]
(PoolActor pid=1172967) *** StackTrace Information *** [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xfe373a) [0x14abf980373a] ray::operator<<() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xfe4ff7) [0x14abf9804ff7] ray::SpdLogMessage::Flush() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x14abf9805497] ray::RayLog::~RayLog() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PS7_IS8_ISN_bESaISU_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0x7bd) [0x14abf8fadc3d] ray::core::CoreWorker::ExecuteTask() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa1e07e) [0x14abf923e07e] std::_Function_handler<>::_M_invoke() [repeated 16x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x7cc50e) [0x14abf8fec50e] ray::core::InboundRequest::Accept() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x114) [0x14abf8fed524] ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x7d014b) [0x14abf8ff014b] ray::core::ActorSchedulingQueue::ScheduleRequests() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue3AddEllSt8functionIFvS2_IFvNS_6StatusES2_IFvvEES5_EEEES2_IFvRKS3_S7_EES7_RKSsRKSt10shared_ptrINS_27FunctionDescriptorInterfaceEENS_6TaskIDERKSt6vectorINS_3rpc15ObjectReferenceESaISO_EE+0x400) [0x14abf8ff1c60] ray::core::ActorSchedulingQueue::Add() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core28CoreWorkerDirectTaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyESt8functionIFvNS_6StatusES8_IFvvEESB_EE+0x119c) [0x14abf8fd2e6c] ray::core::CoreWorkerDirectTaskReceiver::HandleTask() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa24c8e) [0x14abf9244c8e] EventTracker::RecordExecution() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0xa1e4f6) [0x14abf923e4f6] boost::asio::detail::completion_handler<>::do_complete() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cd0eb) [0x14abf98ed0eb] boost::asio::detail::scheduler::do_run_one() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cea69) [0x14abf98eea69] boost::asio::detail::scheduler::run() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(+0x10cf172) [0x14abf98ef172] boost::asio::io_context::run() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0xcd) [0x14abf8f7234d] ray::core::CoreWorker::RunTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x14abf8fb56cc] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x14abf8fb587d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop() [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x205d0b) [0x14ac09e1cd0b] method_vectorcall_NOARGS [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(PyObject_Vectorcall+0x33) [0x14ac09e16a33] PyObject_Vectorcall [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x686) [0x14ac09e9e2a6] _PyEval_EvalFrameDefault [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x285d24) [0x14ac09e9cd24] _PyEval_Vector [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(PyEval_EvalCode+0xa4) [0x14ac09e9cae4] PyEval_EvalCode [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32baed) [0x14ac09f42aed] run_eval_code_obj [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32ba7a) [0x14ac09f42a7a] run_mod [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(+0x32c031) [0x14ac09f43031] pyrun_file [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyRun_SimpleFileObject+0x1a2) [0x14ac09f42d22] _PyRun_SimpleFileObject [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(_PyRun_AnyFileObject+0x44) [0x14ac09f42b64] _PyRun_AnyFileObject [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(Py_RunMain+0x2c6) [0x14ac09f49226] Py_RunMain [repeated 4x across cluster]
(PoolActor pid=1172967) /apps/python3/3.11.7/lib/libpython3.11.so.1.0(Py_BytesMain+0x29) [0x14ac09f48e49] Py_BytesMain [repeated 4x across cluster]
(PoolActor pid=1172967) /lib64/libc.so.6(__libc_start_main+0xe5) [0x14ac08ee2d85] __libc_start_main [repeated 4x across cluster]
(PoolActor pid=1172967) ray::PoolActor(_start+0x2e) [0x40069e] _start [repeated 4x across cluster]
(PoolActor pid=1169760) [symbolize_elf.inc : 1311] RAW: /home/157/lb4583/.local/lib/python3.11/site-packages/ray/_raylet.so (deleted): open failed: errno=2 [repeated 46x across cluster]

example log file (stdout) (too large for inline inclusion)

excerpt showing job details summary:

======================================================================================
                  Resource Usage on 2024-04-07 08:13:57:
   Job Id:             113048560.gadi-pbs
   Project:            xd2
   Exit Status:        1
   Service Units:      198.52
   NCPUs Requested:    168                    NCPUs Used: 168             
                                           CPU Time Used: 24:08:24        
   Memory Requested:   1.5TB                 Memory Used: 236.85GB        
   Walltime requested: 24:00:00            Walltime Used: 00:55:50        
   JobFS requested:    800.0GB                JobFS used: 62.21GB         
======================================================================================

I have on purpose requested exorbitant resources to make sure I’m not hitting the PBS job limits. Even so, there are mentions of SIGSEGV due to high memory usage, and also large numbers of workers, although reported memory usage doesn’t reach anywhere near the requested size. In any case, running with half the resources (half as many CPUs and mem) results in the same failures.

I wonder if anyone more experienced with Ray has some insight to offer here.

Cheers

I believe this is caused by trying to deploy on multiple nodes without first setting up the Ray cluster.