Pipeline with no ray.get and a memory leak

I have a pipeline where a camera actor continuously acquires images, pre-processes the image, sends it to a GPU Actor, and then subsequently to a post-processing actor (which prints the result to console). At no point in the entire process, I am invoking ray.get because I don’t want to introduce a blocking call at any step. I am observing that the memory/RAM consumed by the PostProcessActor slowly increases with time. I see that the plasma store memory roughly remains the same (I am monitoring via the dashboard) but the RAM keeps increasing with time. I am not sure where exactly the leak is. It would be great if I can get some pointers on how to debug this. Could there be a problem if new tasks are constantly created but are never fetched via ray.get ? My pipeline roughly looks like below:

import ray
import numpy as np
class CameraActor:
    def __init__(self, gpu_actor, post_actor):
        self.gpu_actor = gpu_actor
        self.post_actor = post_actor
    def acquire(self):
        while True:
            cam_img = np.random.randint(0, 255, (3000, 5000)).astype(np.uint8)
            pre_img = self.preprocess(cam_img) # of size 3 x 500 x 700
            infer_ref = self.gpu_actor.infer.remote(pre_img)
            self.post_actor.process.remote(cam_img, infer_ref)
    def preproces(self):
class GPUActor:
    def __init__(self):
    def infer(self, pre_img):
        # do some work on the GPU
        return np.zeros((100,100))
class PostProcessActor:
    def __init__(self):
    def process(self, cam_img, infer_results):
        # do some post processing and arrive at a result
        result = ""

Maybe you could try printing out your heap over time upon each invocation of actor.process.remote()?

You would want to run the heap inspection within that class method.

1 Like

Usually the best way to debug the issue is to use ray memory command; Memory Management — Ray v1.1.0. (or the memory tab in the dashboard). You can track if there are objects that are supposed to be deleted but didn’t using this command.

+1 to looking at some more detailed memory information (in particular ray memory may provide some insight here).

Btw, unless you run into OOM issues, growing memory issues aren’t necessarily an issue. Object store eviction doesn’t happen unless the object store actually fills up, Also, if you’re using something like htop to measure your RAM usage, the object store’s shared memory optimization will cause the object store to be double counted as ram usage in your raylet and every worker process on the node.

@rliaw Sure, I will do this and see what’s going on.

@Alex I am not running into any OOM issues and I am inspecting the shared memory size through the dashboard (Machine View). As I mentioned in my post, the Plasma store memory is roughly constant with time and I also track the number of objects through len(ray.state.objects()) - nothing crazy seems to be going on here.

Oh I totally did not know about memory being double counted in htop. In htop I see 25 GB out of 32 GB consumed but I allocated only object_store_memory=10GB when I launched the app. Is there way I can accurately measure the total consumed memory? We have a requirement to precisely monitor memory utilization of the device with time.

Have you figured this out yet?
I have the exactly the same issue. I’ve got a pipeline setup with not a single ray.get placed, still my detector slowly keeps rising in RAM usage.