Pipeline with no ray.get and a memory leak

thegyro · January 22, 2021, 1:06am

I have a pipeline where a camera actor continuously acquires images, pre-processes the image, sends it to a GPU Actor, and then subsequently to a post-processing actor (which prints the result to console). At no point in the entire process, I am invoking ray.get because I don’t want to introduce a blocking call at any step. I am observing that the memory/RAM consumed by the PostProcessActor slowly increases with time. I see that the plasma store memory roughly remains the same (I am monitoring via the dashboard) but the RAM keeps increasing with time. I am not sure where exactly the leak is. It would be great if I can get some pointers on how to debug this. Could there be a problem if new tasks are constantly created but are never fetched via ray.get ? My pipeline roughly looks like below:

import ray
import numpy as np

@ray.remote
class CameraActor:
    def __init__(self, gpu_actor, post_actor):
        self.gpu_actor = gpu_actor
        self.post_actor = post_actor

    def acquire(self):
        while True:
            cam_img = np.random.randint(0, 255, (3000, 5000)).astype(np.uint8)
            pre_img = self.preprocess(cam_img) # of size 3 x 500 x 700
            infer_ref = self.gpu_actor.infer.remote(pre_img)
            self.post_actor.process.remote(cam_img, infer_ref)

    def preproces(self):
        pass

@ray.remote
class GPUActor:
    def __init__(self):
        pass

    def infer(self, pre_img):
        # do some work on the GPU
        return np.zeros((100,100))

@ray.remote
class PostProcessActor:
    def __init__(self):
        pass

    def process(self, cam_img, infer_results):
        # do some post processing and arrive at a result
        result = ""
        print(result)

rliaw · January 22, 2021, 1:10am

Maybe you could try printing out your heap over time upon each invocation of actor.process.remote()?

You would want to run the heap inspection within that class method.

sangcho · January 22, 2021, 1:10am

Usually the best way to debug the issue is to use ray memory command; Memory Management — Ray v1.1.0. (or the memory tab in the dashboard). You can track if there are objects that are supposed to be deleted but didn’t using this command.

Alex · January 22, 2021, 1:19am

+1 to looking at some more detailed memory information (in particular ray memory may provide some insight here).

Btw, unless you run into OOM issues, growing memory issues aren’t necessarily an issue. Object store eviction doesn’t happen unless the object store actually fills up, Also, if you’re using something like htop to measure your RAM usage, the object store’s shared memory optimization will cause the object store to be double counted as ram usage in your raylet and every worker process on the node.

thegyro · January 22, 2021, 1:28am

@rliaw Sure, I will do this and see what’s going on.

@Alex I am not running into any OOM issues and I am inspecting the shared memory size through the dashboard (Machine View). As I mentioned in my post, the Plasma store memory is roughly constant with time and I also track the number of objects through len(ray.state.objects()) - nothing crazy seems to be going on here.

Oh I totally did not know about memory being double counted in htop. In htop I see 25 GB out of 32 GB consumed but I allocated only object_store_memory=10GB when I launched the app. Is there way I can accurately measure the total consumed memory? We have a requirement to precisely monitor memory utilization of the device with time.

wbenoot · April 8, 2021, 5:08pm

Have you figured this out yet?
I have the exactly the same issue. I’ve got a pipeline setup with not a single ray.get placed, still my detector slowly keeps rising in RAM usage.

Topic		Replies	Views
Ray Actor RAM usage keep growing Ray Core	7	1126	June 9, 2021
Leaking worker memory Ray Core	9	470	February 19, 2021
How to interpret ray memory results Ray Core	9	1150	August 11, 2021
RAM issue in Ray Ray Core	4	897	January 16, 2021
Driver memory increasing indefinitely when returning a numpy array Ray Core	0	278	December 16, 2020

Pipeline with no ray.get and a memory leak

Related topics