Hi(repost since I can’t find in search and I can’t edit it),
I just wanted help confirming that this is an application error(my issue) and not a ray issue. To give some context. A ray actor(no max_concurrency
) is calling some multiprocessing
module code. This is happening during load testing, when we are hitting Ray Serve with high volume of requests much more than it can handle.
Since the actor has no max_concurrency
, the calls on it should be streamlined? So, we are not sure why the multiprocessing
code is failing. Note that, there are other Ray actors(including Ray Serve replicas) and tasks that are running concurrently in this application.
Is it possible that the multiprocessing Pool and ray workers are contending for resources and in some way, Pool is starved for resources and exits?
021-09-23 23:34:08,088 ERROR worker.py:428 -- SystemExit was raised from the worker
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 640, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 525, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 532, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 536, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 486, in ray._raylet.execute_task.function_executor
File "/opt/conda/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/opt/my_classification/my_classification/yolov3_util.py", line 26, in predict
return self.yolov3_detector.predict(image_path, threshold, display_image)
...
...
return do_pooled_work(detect_qr_code_worker, pack_args(), num_pools, timeout)
File "/opt/my_classification/my_classification/prediction/lib/qr_code.py", line 218, in do_pooled_work
with Pool(num_pools) as pool:
File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
self._repopulate_pool()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
w.start()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 76, in _launch
os._exit(code)
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 425, in sigterm_handler
sys.exit(1)
SystemExit: 1
*** SIGSEGV received at time=1632440048 on cpu 10 ***
PC: @ 0x7fcc6a63c078 (unknown) pollset_shutdown()
@ 0x7fcc6bd5d980 (unknown) (unknown)
@ 0x7fcc6a618300 32 cq_shutdown_next()
@ 0x7fcc6a618732 176 grpc_completion_queue_shutdown
@ 0x7fcc69ff0e54 64 ray::gcs::ServiceBasedGcsClient::~ServiceBasedGcsClient()
@ 0x7fcc69fecbba 32 std::_Sp_counted_base<>::_M_release()
@ 0x7fcc6a0c7778 80 ray::core::CoreWorker::~CoreWorker()
@ 0x7fcc69fecbba 32 std::_Sp_counted_base<>::_M_release()
@ 0x7fcc6a0e944e 144 ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
@ 0x7fcc69f743d7 32 __pyx_pw_3ray_7_raylet_10CoreWorker_9run_task_loop()
@ 0x55f45df76791 (unknown) _PyMethodDef_RawFastCallKeywords
@ 0x7fcc69f743c0 (unknown) (unknown)
Here is the code for multiprocessing
def do_pooled_work(func, iterable, num_pools, timeout):
with Pool(num_pools) as pool:
results = pool.imap_unordered(func, iterable)
while True:
try:
r = results.next(timeout=timeout)
except StopIteration:
break
except TimeoutError:
print("TimeoutError")
break
except cv2.error as exc:
logger.error(exc)
break
if r:
return r
return ""