System Exit. Ray Error or App Error(repost as can't edit)

Hi(repost since I can’t find in search and I can’t edit it),

I just wanted help confirming that this is an application error(my issue) and not a ray issue. To give some context. A ray actor(no max_concurrency) is calling some multiprocessing module code. This is happening during load testing, when we are hitting Ray Serve with high volume of requests much more than it can handle.

Since the actor has no max_concurrency, the calls on it should be streamlined? So, we are not sure why the multiprocessing code is failing. Note that, there are other Ray actors(including Ray Serve replicas) and tasks that are running concurrently in this application.

Is it possible that the multiprocessing Pool and ray workers are contending for resources and in some way, Pool is starved for resources and exits?

021-09-23 23:34:08,088  ERROR worker.py:428 -- SystemExit was raised from the worker
  Traceback (most recent call last):
    File "python/ray/_raylet.pyx", line 640, in ray._raylet.task_execution_handler
    File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
    File "python/ray/_raylet.pyx", line 525, in ray._raylet.execute_task
    File "python/ray/_raylet.pyx", line 532, in ray._raylet.execute_task
    File "python/ray/_raylet.pyx", line 536, in ray._raylet.execute_task
    File "python/ray/_raylet.pyx", line 486, in ray._raylet.execute_task.function_executor
    File "/opt/conda/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
      return method(__ray_actor, *args, **kwargs)
    File "/opt/my_classification/my_classification/yolov3_util.py", line 26, in predict
      return self.yolov3_detector.predict(image_path, threshold, display_image)
      ...
      ...
     return do_pooled_work(detect_qr_code_worker, pack_args(), num_pools, timeout)
    File "/opt/my_classification/my_classification/prediction/lib/qr_code.py", line 218, in do_pooled_work
      with Pool(num_pools) as pool:
    File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 119, in Pool
      context=self.get_context())
    File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
      self._repopulate_pool()
    File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
      w.start()
    File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
      self._popen = self._Popen(self)
    File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
      return Popen(process_obj)
    File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
      self._launch(process_obj)
    File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 76, in _launch
      os._exit(code)
    File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 425, in sigterm_handler
      sys.exit(1)
  SystemExit: 1
  *** SIGSEGV received at time=1632440048 on cpu 10 ***
  PC: @     0x7fcc6a63c078  (unknown)  pollset_shutdown()
      @     0x7fcc6bd5d980  (unknown)  (unknown)
      @     0x7fcc6a618300         32  cq_shutdown_next()
      @     0x7fcc6a618732        176  grpc_completion_queue_shutdown
      @     0x7fcc69ff0e54         64  ray::gcs::ServiceBasedGcsClient::~ServiceBasedGcsClient()
      @     0x7fcc69fecbba         32  std::_Sp_counted_base<>::_M_release()
      @     0x7fcc6a0c7778         80  ray::core::CoreWorker::~CoreWorker()
      @     0x7fcc69fecbba         32  std::_Sp_counted_base<>::_M_release()
      @     0x7fcc6a0e944e        144  ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
      @     0x7fcc69f743d7         32  __pyx_pw_3ray_7_raylet_10CoreWorker_9run_task_loop()
      @     0x55f45df76791  (unknown)  _PyMethodDef_RawFastCallKeywords
      @     0x7fcc69f743c0  (unknown)  (unknown)

Here is the code for multiprocessing

def do_pooled_work(func, iterable, num_pools, timeout):

    with Pool(num_pools) as pool:
        results = pool.imap_unordered(func, iterable)
        while True:
            try:
                r = results.next(timeout=timeout)
            except StopIteration:
                break
            except TimeoutError:
                print("TimeoutError")
                break
            except cv2.error as exc:
                logger.error(exc)
                break
            if r:
                return r
    return ""

Is it possible it’s a memory issue? Often processes get SIGTERM if exceeding k8s memory limits or other kinds of container limits, which can explain the SystemExit.

1 Like

I think you are right. I thought it was CPU but that doesn’t seem to be the case. I’ll test it out and let you know. Thanks for the info.