Ray wont release memory. not even after ray.shutdown()

Liquidmasl · August 28, 2024, 11:18pm

I am running a docker container with flask, In this flask app i start a thread (!) in this thread i call ray,init()
when this task is done, all references to objects that might have been created and whatnot are not referenced anymore.
Ray still needs a lot of RAM (150gb+)
When I start a new task this just gets worse until the pc dies.

In other containers the tasks run in processes, not threads, in that case i can kill the process to also kill the ray cluster.
but in the thread I cant seam to manage that. ray.shutdown() also doesnt… do much?
It kills … a few processes. but not all. And running a new task leads to errors which kill the container.
When the container restarts ray finally is shut down and has released memory.

But thats not a solution i cant restart thge whole container every time.

How can I make ray reset? or actually shut down? or free its memory?

I am at a loss… again…

Sam_Chan · August 28, 2024, 11:25pm

@Liquidmasl can you share a repro script for this? Plus a screenshot of the Ray Dashboard that shows these memory hold outs? Are there IDLE workers when you look at the Dashboard?

Liquidmasl · August 29, 2024, 1:34am

you are quicker then i can refine my post!

I will provide more tomorrow if I can, I am 14 hours deep into ray today and about 100 in the past 2 weeks into ray and modin.
There is a chance quite a bit above zero that I am just overlooking something/mixing things up here with my multi processing here, multithreading there, attempting this and that and inbetween waiting for unreasonable amount of time because i work with large data and did not spend the time making myself a repro, etc etc.

For now, it seams ray.shutdown() does indeed free most of my memory if I call it at the right time, and also kills the dashboard, as I was expecting. But when I call ray.init in the same process again strange issues arise.

Before ray.shutdown basically all workers appear as idle. And even if all references to objecs are lost nothing is garbage collected. But maybe modin is incorrectly keeping some references that I have no control over.

So right now in the task, when its done, i call ray.shutdown() and kill the tread
when a new task comes in (probably immediatly) a new thread is opened where ray.init() is called.

but this leads to
2024-08-29T01:14:48.547596851Z 01:14:48 | 243 | ..intern_depend.flasker.src.task_flasker | [CRITICAL] | segmentor: An uncaught exception was raised: - An application is trying to access a Ray object whose owner is unknown(00ffffffffffffffffffffffffffffffffffffff0100000005e1f505). Please make sure that all Ray objects you are trying to access are part of the current Ray session. Note that object IDs generated randomly (ObjectID.from_random()) or out-of-band (ObjectID.from_binary(...)) cannot be passed as a task argument because Ray does not know which task created them. If this was not how your object ID was generated, please file an issue at https://github.com/ray-project/ray/issues/

Whch I dont understand…
The object is definitely a new one as it is loaded from .parquet and is a different .parquet then in the task before.
Maybe because of some race condition it creates the object in the previous ray session? then destroys it and…well i dont know.
Actually maybe its some function (read_parquet in this case)?

Anyway it sure does not like it when i shutdown and reinit in the same process as it seams.
sadly i cant do multiprocessing because cuda doesnt like it. So what can i do?

I wish I could just stick with the same session without shutting it down at all, and just free the memory somehow. Nothing from the previous stuff is needed anymore.
But i havent found a way to do that

Anyway, I need sleep. I will provide more tomorrow
Thank you

Ps: can it be that modin somehow … registers functions with ray remote and when the ray instance changes it does not re registers them?
It kinda makes sense to me now. Althout it sucks…

Liquidmasl · August 29, 2024, 1:43am

2024-08-29 03:40:50,412	WARNING worker.py:1423 -- SIGTERM handler is not set because current thread is not the main thread.
2024-08-29 03:40:52,253	INFO worker.py:1753 -- Started a local Ray instance.
ray.shutdown()
2024-08-29 03:41:00,172	WARNING worker.py:1423 -- SIGTERM handler is not set because current thread is not the main thread.
2024-08-29 03:41:02,023	INFO worker.py:1753 -- Started a local Ray instance.

Oh… I just noticed that those log entries are not from the ray.init() call but from the shutdown call… so i guess the shutdown already does not work…?

I need sleep…

and after some time this comes

[2024-08-29 03:42:04,080 E 1438786 1869706] gcs_rpc_client.h:564: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out.
https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
The program will terminate.

Liquidmasl · August 29, 2024, 9:15am

So yeah , after some sleep it seams that modin is the culprit.

See this issue: BUG: when ray is shutdown and re initialized, modin methods dont work anymore, throwing exception · Issue #7378 · modin-project/modin (github.com)

I modin keeping references to some internally used objets also lead to the memory that is not released.

This is very annoying because I am unsure If i can remedy that. Currently trying to debuug through modin looking for a way to ‘clear the cashe’ or something…

Liquidmasl · August 29, 2024, 9:36am

OR maybe not.
As I am stumbling though the code it sure does look like modin is puttint the function into the ray cluster and getting the object reference back from it.
I fail to see it using some sort of cache.

Maybe it is a ray issue after all?
I dont know Its super hard to step though to code and understand whats going on

Liquidmasl · August 29, 2024, 10:48am

all the debugging let me to worker.py

in.put() we get the global_worker.
This is the only place where i could assume something goes wrong. maybe the global worker is wrong, or the current session that is attached to it… or something

Topic		Replies	Views
Memory (RAM) not being released by Ray Ray Core	17	2083	August 26, 2022
Ray doesnt release memory on shutdown (?)	0	234	December 2, 2023
[Core] How to reslove RayOutOfMemoryError in python for ray package? Ray Core	5	961	April 29, 2021
The use of ray in the long-time service Ray Client	3	396	June 29, 2023
Leaking worker memory Ray Core	9	455	February 19, 2021

Ray wont release memory. not even after ray.shutdown()

Related topics