Pytest OutOfMemoryError error with Ray Actors

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

The issue
Dockerized Actors code completes successfully in my 10GB container. When I run my pytests (using the same test case within same container), the e2e test fails with the error ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

Testing process is (from within docker exec shell):

  • python <run my program> Completes :white_check_mark:
  • pytest Fails with OutOfMemoryError :x:

Debugging details
I have two Actors - one handles OCR tasks and the other LayoutAnalysis tasks. My only Ray-enabled function is:

def layout_analysis_ocr(self, jpgs_buffer):
        agg_text = []

        # Create async actors to execute OCR and layout analysis
        ocr_actors = [Ocr.remote() for i in range(4)]
        lp_actors = [
            LayoutAnalysis.remote(self.layout_analysis_model) for i in range(4)
        ]

        # Gather actors into an ordered pool and distribute tasks to each
        actor_pool = ActorPool(ocr_actors + lp_actors)
        actor_pool_mapping = actor_pool.map(
            lambda a, d: a.process.remote(d), jpgs_buffer[:4] * 2
        )

        # ActorPool mapping synchronously pulls result futures in reverse order
        results = list(actor_pool_mapping)
        lp_results = results[:4]
        ocr_results = results[4:]

        for lp_result, ocr_result in zip(lp_results, ocr_results):
            page_text = [
                txt for txt in self.ocr_mapping(lp_result[0], ocr_result, lp_result[1])
            ]
            agg_text.extend(page_text)

        # Return list of aggregated text
        return agg_text

The above completes correctly when I execute normally. However, when I execute an e2e pytest, I get the error:

src/extractor.py in layout_analysis_ocr
    results = list(actor_pool_mapping)
/usr/local/lib/python3.9/site-packages/ray/util/actor_pool.py:83: in map
    yield self.get_next()
/usr/local/lib/python3.9/site-packages/ray/util/actor_pool.py:218: in get_next
    return ray.get(future)
/usr/local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105: in wrapper
    return func(*args, **kwargs)
-----------------------------------------------------------
-----------------------------------------------------------
E                       ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
E                       Memory on the node <info> where the task (actor ID:# name=LayoutAnalysis.__init__, memory used=1.29GB) was running was 10.21GB / 10.70GB (0.954247), which exceeds the memory usage threshold of 0.95. Ray killed this worker <info> because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip #`. To see the logs of the worker, use `ray logs worker-<info>`. 

Top 10 memory users:
E                       PID	MEM(GB)	COMMAND
E                       #	2.59	        /usr/local/bin/python /usr/local/bin/pytest
E                       #	2.36	        python3 -m src.grpc
E                       #	1.29	        ray::LayoutAnalysis.process
E                       #	1.27	        ray::LayoutAnalysis.process
E                       #	0.32	        ray::LayoutAnalysis
E                       #	0.32	        ray::LayoutAnalysis
E                       #	0.06	        /usr/local/bin/python -u <info>
E                       #	0.06	        /usr/local/bin/python /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py
E                       #	0.06	        ray::Ocr
E                       #	0.06	        ray::Ocr

E                       Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

@Erin-Boehmer What is e2e pytest is doing which is different from the normal script. Perhaps, include the script here.

The functions are identical, but one runs as a pytest and the other is called from the command line:

test_extractor.py (pytest):

    @pytest.mark.e2e
    def test_extract_text(self):
        doc_link = "<s3_file.pdf>"
        extraction_model = extract_methods(doc_link)
        fields = extraction_model.extract_text()
        assert fields is not None

call from script:

def benchmark_text_extraction():
    doc_link = "<s3_file.pdf>"
    extraction_model = extract_methods(doc_link)
    success = extraction_model.extract_text()
    return

if __name__ == "__main__":
    benchmark_text_extraction()

Since Ray attempts to prevent out-of-memory errors via the memory monitor it seems this issue results from pytest artifacts added to the runtime. I increased the docker container memory size now that I’m more confident the scope of this error is limited to pytest.