Object Storage Management with Ray Actor Tasks that does not need to be saved once executed

claire.lim · February 20, 2024, 7:24am

How severe does this issue affect your experience of using Ray?
High

I am currently using Ray as a Inference Engine for our ML Service which receives API calls (FastAPI) to trigger inference on a stream RTSP input. My current Ray setup is running in a docker container with one GPU which is used by the ML model actor and other tasks are all running on CPUs. Shared Memory size set for the docker container is 64GB.

The overall workflow is:

FastAPI controller → Ray Actor Manager (start/stop the Ray Tasks for stream reading processes) → Ray Task for long-running stream reads → continuously send remote calls to ML Model Actor for inference on the retrieved frames, which the function does not return anything (it just saves the inference results in the DB)

My question here is when I make a series of remote inference task calls to the ML Model Actor from a non-termination Ray Task, it seems like the Object Store continuously increases even after the remote inference tasks execution were finished. I tried to delete the object reference for the remote call but it does not help the situation.

My code for the stream processing Ray Task would look something like the following:

@ray.remote
def start_stream(start_request: StartStreamRequest):
    ...
    while True:
        ...
        detection_request: StartDetectionRequest = StartDetectionRequest(
                        imgs=imgs,
                        frame_timestamps=frame_timestamps,
                        camera_id=start_request.camera_id,\)
        obj_ref =ml_model.inference.remote(detection_request)
        del obj_ref
        ....

I would like to know if there is any way to manage the tasks that should be executed in a “detached mode”. And how GC works for the finished tasks that are owned by long-running tasks.

From running “ray memory command”, I see that the object refs for the finished tasks remains like the following:

192.168.208.2 | 1        | Driver  | disabled  | FINISHED  | 248201617.0 B | LOCAL_REFERENCE | 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505

192.168.208.2 | 1        | Driver  | disabled  | FINISHED  | 253791831.0 B | LOCAL_REFERENCE | 00ffffffffffffffffffffffffffffffffffffff0100000005e1f505

192.168.208.2 | 1        | Driver  | disabled  | FINISHED  | 351477931.0 B | LOCAL_REFERENCE | 00ffffffffffffffffffffffffffffffffffffff0100000003e1f505

Also, I would like to know how the concurrency works for the actor tasks. I believe the default concurrency is 1000 for an Actor. But in my case, after running a full round of full concurrent execution of 1000 tasks, it will eventually slow down and would only execute 100~300 tasks on an actor concurrently. I am guessing this might be due to the memory issue that I am seeing, but please let me know if there are any other things that I should consider or test out. Thank you!!!

jjyao · February 24, 2024, 5:38am

Hi @claire.lim

What’s the Ray version you are using? Also how do you know the object store continuously increasing?

Also can you try gc.collect() after del obj_ref to manually trigger a gc?

claire.lim · February 25, 2024, 11:36am

Hello,

Thank you for the reply.

I am using 2.9.1 and I see the that the object store increasing continuously (and of course the spill) in the ray dashboard.

I will give the gc.collect() a try like you’ve suggested and update the post. Thank you!

claire.lim · February 27, 2024, 7:50am

I tried gc.collect() and it does seem to help greatly with keeping the object storage memory at a reasonable size (no spilling like before).

Thank you for your help!

claire.lim · February 28, 2024, 4:01am

I’ve confirmed that gc.collect does help with removing unwanted object storage. But one thing that I am not too sure is why I would need to trigger manual garbage collection. It does effect performance to trigger the gc manually and I want to know if there is anything that I should look into/do better management on the object storage other than triggering the gc.collect.

Topic		Replies	Views
Async submit task to a ActorPool Ray Core	2	248	February 25, 2024
Utilising Ray for Simple Parallelism (Batch Inference)	1	885	March 28, 2023
Proper workflow for keeping Ray memory clean and separating returned python objects from their Ray references Ray Core	6	2920	May 11, 2022
Specifying extra resources for functions (tasks) running inside an Actor? Ray Core	2	345	September 27, 2023
How to increase ray performance for cpu and io bound operations in a task Ray Core	9	971	August 9, 2021

Object Storage Management with Ray Actor Tasks that does not need to be saved once executed

Related topics