Hi(repost since I can’t find in search and I can’t edit it),
I just wanted help confirming that this is an application error(my issue) and not a ray issue. To give some context. A ray actor(no
max_concurrency) is calling some
multiprocessing module code. This is happening during load testing, when we are hitting Ray Serve with high volume of requests much more than it can handle.
Since the actor has no
max_concurrency, the calls on it should be streamlined? So, we are not sure why the
multiprocessing code is failing. Note that, there are other Ray actors(including Ray Serve replicas) and tasks that are running concurrently in this application.
Is it possible that the multiprocessing Pool and ray workers are contending for resources and in some way, Pool is starved for resources and exits?
021-09-23 23:34:08,088 ERROR worker.py:428 -- SystemExit was raised from the worker Traceback (most recent call last): File "python/ray/_raylet.pyx", line 640, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 525, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 532, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 536, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 486, in ray._raylet.execute_task.function_executor File "/opt/conda/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/opt/my_classification/my_classification/yolov3_util.py", line 26, in predict return self.yolov3_detector.predict(image_path, threshold, display_image) ... ... return do_pooled_work(detect_qr_code_worker, pack_args(), num_pools, timeout) File "/opt/my_classification/my_classification/prediction/lib/qr_code.py", line 218, in do_pooled_work with Pool(num_pools) as pool: File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 119, in Pool context=self.get_context()) File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 176, in __init__ self._repopulate_pool() File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 76, in _launch os._exit(code) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 425, in sigterm_handler sys.exit(1) SystemExit: 1 *** SIGSEGV received at time=1632440048 on cpu 10 *** PC: @ 0x7fcc6a63c078 (unknown) pollset_shutdown() @ 0x7fcc6bd5d980 (unknown) (unknown) @ 0x7fcc6a618300 32 cq_shutdown_next() @ 0x7fcc6a618732 176 grpc_completion_queue_shutdown @ 0x7fcc69ff0e54 64 ray::gcs::ServiceBasedGcsClient::~ServiceBasedGcsClient() @ 0x7fcc69fecbba 32 std::_Sp_counted_base<>::_M_release() @ 0x7fcc6a0c7778 80 ray::core::CoreWorker::~CoreWorker() @ 0x7fcc69fecbba 32 std::_Sp_counted_base<>::_M_release() @ 0x7fcc6a0e944e 144 ray::core::CoreWorkerProcess::RunTaskExecutionLoop() @ 0x7fcc69f743d7 32 __pyx_pw_3ray_7_raylet_10CoreWorker_9run_task_loop() @ 0x55f45df76791 (unknown) _PyMethodDef_RawFastCallKeywords @ 0x7fcc69f743c0 (unknown) (unknown)
Here is the code for
def do_pooled_work(func, iterable, num_pools, timeout): with Pool(num_pools) as pool: results = pool.imap_unordered(func, iterable) while True: try: r = results.next(timeout=timeout) except StopIteration: break except TimeoutError: print("TimeoutError") break except cv2.error as exc: logger.error(exc) break if r: return r return ""