How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
The issue
Dockerized Actors code completes successfully in my 10GB container. When I run my pytests (using the same test case within same container), the e2e test fails with the error ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Testing process is (from within docker exec shell):
-
python <run my program>
Completes -
pytest
Fails with OutOfMemoryError
Debugging details
I have two Actors - one handles OCR tasks and the other LayoutAnalysis tasks. My only Ray-enabled function is:
def layout_analysis_ocr(self, jpgs_buffer):
agg_text = []
# Create async actors to execute OCR and layout analysis
ocr_actors = [Ocr.remote() for i in range(4)]
lp_actors = [
LayoutAnalysis.remote(self.layout_analysis_model) for i in range(4)
]
# Gather actors into an ordered pool and distribute tasks to each
actor_pool = ActorPool(ocr_actors + lp_actors)
actor_pool_mapping = actor_pool.map(
lambda a, d: a.process.remote(d), jpgs_buffer[:4] * 2
)
# ActorPool mapping synchronously pulls result futures in reverse order
results = list(actor_pool_mapping)
lp_results = results[:4]
ocr_results = results[4:]
for lp_result, ocr_result in zip(lp_results, ocr_results):
page_text = [
txt for txt in self.ocr_mapping(lp_result[0], ocr_result, lp_result[1])
]
agg_text.extend(page_text)
# Return list of aggregated text
return agg_text
The above completes correctly when I execute normally. However, when I execute an e2e pytest, I get the error:
src/extractor.py in layout_analysis_ocr
results = list(actor_pool_mapping)
/usr/local/lib/python3.9/site-packages/ray/util/actor_pool.py:83: in map
yield self.get_next()
/usr/local/lib/python3.9/site-packages/ray/util/actor_pool.py:218: in get_next
return ray.get(future)
/usr/local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:105: in wrapper
return func(*args, **kwargs)
-----------------------------------------------------------
-----------------------------------------------------------
E ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
E Memory on the node <info> where the task (actor ID:# name=LayoutAnalysis.__init__, memory used=1.29GB) was running was 10.21GB / 10.70GB (0.954247), which exceeds the memory usage threshold of 0.95. Ray killed this worker <info> because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip #`. To see the logs of the worker, use `ray logs worker-<info>`.
Top 10 memory users:
E PID MEM(GB) COMMAND
E # 2.59 /usr/local/bin/python /usr/local/bin/pytest
E # 2.36 python3 -m src.grpc
E # 1.29 ray::LayoutAnalysis.process
E # 1.27 ray::LayoutAnalysis.process
E # 0.32 ray::LayoutAnalysis
E # 0.32 ray::LayoutAnalysis
E # 0.06 /usr/local/bin/python -u <info>
E # 0.06 /usr/local/bin/python /usr/local/lib/python3.9/site-packages/ray/dashboard/dashboard.py
E # 0.06 ray::Ocr
E # 0.06 ray::Ocr
E Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.