Memory (RAM) not being released by Ray

shyampatel · August 16, 2022, 11:58am

High: It blocks me to complete my task.

In my project pipeline, I have two actors, CamActor (producer) and StreamActor (consumer). CamActor reads a camera streaming and put frame (python class instance having RGB frame and meta information) into frameHolder (Queue). StreamActor reads frames asynchronously from frameHolder and generate encoded streaming .
Camera streaming is at 20 fps and StreamActor can work on 45 fps.

For every new job, pipeline creates new CamActor and frameHolder (queue), StreamActor is common for each job.

When I start two parallel job, it works perfectly, as StreamActor can handle 40 fps (two camera frameholder are added data with 20 fps). Now when I start third job, there will be some frames left in each queue, as StreamActor can only handle 45 fps but input is 60 fps (three camera frameholder are added data with 20 fps). So, every second 15 frames will be left in queue, which will cause to memory increment.

Now, main issue here is as follows:
If I start two jobs, some memory will be occupied and when I stop it after sometime, memory will be released.
But when I start three jobs, some memory will be occupied at start and memory consumption will increase gradually (due to extra frames in frameholder). Now, if I stop any/all the jobs, ray is not releasing memory which has been occupied by this extra frames.

Things tested:

tried to clear queue (read complete data and delete it) before shutdown.
added gc.collect()

Note: I can’t use ray.shutdown() here at end of one job as there are parallel jobs are running which is sharing same StreamActor.

jjyao · August 17, 2022, 12:19am

Hi @shyampatel,

How is your Queue implemented, is it an actor? How do you stop your jobs? Could you try ray memory and see what it tells you?

shyampatel · August 17, 2022, 4:37am

Thanks for quick response @jjyao .

I am using ray internal queue, from ray.util.queue import Queue.

My stop logic is as following:

For each job, I have assigning unique job_id.
All the actors are detached.
Each actor has alive flag, based on that it will come out of continuous while loop. Each actor has start_job and stop_job functionality. stop_job will disable the flag, remove extra data from queue and shutdown the queue.
I am giving name of CamActor based on job_id.
StreamActor is defined with fixed name. For each job, same actor will be used, we will call asynchronous start_job method and pass frame_holder as argument.
while stop call, we will pass job_id which we want to stop. It will get both the actors and call stop_job.

During the jobs, ray memory returned following:

192.168.10.101    16177  Worker  disabled                -               519485.0 B  PINNED_IN_MEMORY    9c40ab3d946b66c56dd9193d64d6aeab067f04250300000001000000
192.168.10.101    17942  Worker  disabled                -               519686.0 B  PINNED_IN_MEMORY    17c06bbeb8257d3a1bcff6cb31db22b0ff54384d0400000001000000
192.168.10.101    16187  Worker  disabled                -               519689.0 B  PINNED_IN_MEMORY    b5211e7a44521434a5d83baccfaa3db42d9c4fa60200000001000000
192.168.10.101    17810  Worker  disabled                -               520267.0 B  PINNED_IN_MEMORY    3bf4fc1c506592416740a95795520bd9356963ec0400000001000000
192.168.10.101    17026  Worker  disabled                -               520270.0 B  PINNED_IN_MEMORY    e0ec20487db2284444c9fb3df8db6f7c15395f200300000001000000
                                        .................... and few more lines like above
--- Aggregate object store stats across all nodes ---
Plasma memory usage 1833 MiB, 3704 objects, 40.26% full, 0.04% needed
Objects consumed by Ray tasks: 40829 MiB.

After ending all jobs, ray memory returned following:

======== Object references status: 2022-08-17 09:45:42.952123 ========
Grouping by node address...        Sorting by object size...        Display allentries per group...


To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 0 MiB, 6 objects, 0.0% full, 0.0% needed
Spilled 15 MiB, 31 objects, avg write throughput 11 MiB/s
Objects consumed by Ray tasks: 56454 MiB.

shyampatel · August 17, 2022, 4:43am

I am using custom resources for each actor. After stop any of job, actors are killed perfectly and releasing cpu and custom resources. In dashboard also, after stop job, actors for that job will be DEAD.

jjyao · August 17, 2022, 5:57pm

When you stop all jobs, which process is holding the memory that you think should be released? Is it object store memory or heap memory? Could you check Queue.empty() to make sure it’s actually empty?

Also for the detached actors, do you manually kill it since it won’t be automatically GCed (Terminating Actors — Ray 1.13.0)?

shyampatel · August 18, 2022, 6:28am

When I stop job, each actor is killed successfully. After stopping all jobs, when I check in dashboard, object store memory and heap memory will come to it’s start point.
Before job start:

After stopping all job:

But when I check ram usage using htop, I found that some memory is still occupied.
Before job start:

After stopping all job:

Yes, using Queue.empty() I have checked at the end that queue is empty.

For detached actors, in stop_job I am manually killing it.

jjyao · August 19, 2022, 7:24am

Seems after all jobs stopped, Ray dashboard and htop have different view of how much memory is still occupied. Could you check what other command says like free?

shyampatel · August 19, 2022, 1:50pm

Before job start:
ray dashboard:

htop:

free

After stopping all job:
ray dashboard:

htop:

free:

Note:
we are running cluster using cluster.yaml file. So when we make the cluster down using ray down cluster.yaml command, it is releasing this extra occupied memory.

jjyao · August 22, 2022, 5:14am

Hi @shyampatel,

Thanks for sharing those screenshots. Seems there is around 1.8G that’s not freed. For the next step, could you share the memory usage of Ray processes (e.g. raylet, dashboard process) before and after. I’m trying to figure out which process is not releasing the memory. Also could you run df -BK | grep tmpfs before and after?

My current thinking is that when you stop all jobs, ray system processes are still running. Specifically raylet is running and it still holds the object store memory (even though there is no objects since we preallocate it)

shyampatel · August 22, 2022, 6:50am

Thanks for your continuous support and suggestions @jjyao .

I have attached all the required information to debug the issue:

Before job start:
Ray Dashboard:

htop with ray processes:

free & df -BK | grep tmpfs

ray memory:

After job start:
Ray Dashboard:

htop with ray processes:

free & df -BK | grep tmpfs

ray memory:

jjyao · August 22, 2022, 5:04pm

Thanks.

Could you check /tmp/ray/session_latest/logs/raylet.out, in the first few lines, you should see something like [2022-08-21 22:35:16,448 I 52466 1147664] (raylet) store_runner.cc:48: Starting object store with directory /tmp, fallback /tmp/ray, and huge page support disabled. Basically I want to see the directory of the object store and can you show the before and after disk usage of that directory?

shyampatel · August 23, 2022, 5:06am

Find the first few lines of /tmp/ray/session_latest/logs/raylet.out:

[2022-08-23 09:55:16,261 I 238744 238744] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-08-23 09:55:16,262 I 238744 238744] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 4.60466GB of memory.
[2022-08-23 09:55:16,262 I 238744 238744] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2022-08-23 09:55:16,262 I 238744 238779] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(4604690440, /dev/shm/plasmaXXXXXX)
[2022-08-23 09:55:16,262 I 238744 238779] (raylet) store.cc:546: ========== Plasma store: =================
Current usage: 0 / 4.60466 GB

Following are some required debug information:
Before job start:

After job start:

Note:

I have analyzed disk usage of object store directory (/dev/shm) during the job and it was 0 (Zero) for complete job.

shyampatel · August 25, 2022, 4:44am

@jjyao
Can you please analyze above information and give some insights about object store memory? And how can I be sure that it’s been clean at the end of job? How can I manually clean it?

jjyao · August 25, 2022, 5:44pm

Sorry for the late reply. It turns out that du -hd 0 /dev/shm doesn’t show the correct usage of object store memory. Could you run df -h | grep tmpfs before and after?

shyampatel · August 26, 2022, 5:49am

Before job start:

After job start:

Note:

Here /dev/shm storage usage is increasing.

jjyao · August 26, 2022, 6:22am

Yea, that makes sense. So object store will preallocate/reserve a fixed amount of memory during the lifetime of the cluster and it won’t be freed even there is no object until the cluster is shut down. By default, we won’t pre-populate the object store memory which means the physical memory will only be allocated by OS as we put objects to the object store, but once it’s allocated, it will always be there.

In summary, I think the memory not being released is occupied by the object store and it’s expected since object store will keep that memory for future objects.

shyampatel · August 26, 2022, 8:25am

Thanks for insights. @jjyao

Is there any way, we can reset the object store memory without ray shutdown?

Actually, there are multiple pipelines are running in system (some with ray and some without ray). So we can’t allow a single pipeline to occupy extra space when no job is going on.

jjyao · August 26, 2022, 5:57pm

Hi @shyampatel,

Currently there is no way you can reset the object store memory without ray shutdown. Although you can specify how much memory object store should reserve via --object-store-memory option.

Topic		Replies	Views
Memory leak in ray Actor Ray Core	4	143	December 19, 2024
Pipeline with no ray.get and a memory leak Ray Core	5	740	April 8, 2021
Ray wont release memory. not even after ray.shutdown() Ray Core	6	125	August 29, 2024
Ray Actor RAM usage keep growing Ray Core	7	1074	June 9, 2021
Object Storage Management with Ray Actor Tasks that does not need to be saved once executed Ray Core	4	239	February 28, 2024

Memory (RAM) not being released by Ray

Related topics