Garbage collection through cloudpicikle

marsupialtail · August 1, 2022, 2:13am

I did an interesting experiment. Start two servers with GPUs attached. Set one of them to be head node one of them to worker node, etc.

Now on the worker node run the following code:

ray.init("auto").
a = torch.randn(10,1024,1024).cuda()
z = ray.put(a)
ray.cloudpickle.dumps(z)

Now paste this output to the head node:
a = ray.get(ray.cloudpickle.loads(OUTPUT_FROM_ABOVE))

a is now auto-magically a Torch cuda Tensor. Bravo ray.get!

But now the memory footprint, ~40MB, is on both machines! I have found no way to free this 40MB from either machine. Python del doesn’t work. ray.internal.internal_api.free doesn’t work.

Is it because I am doing something bad with cloudpickle and it breaks Ray’s reference counting?

rickyyx · August 1, 2022, 5:00pm

cc @suquark on this - are you aware of such behaviour from ray cloudpickle?

suquark · August 1, 2022, 6:16pm

ray.cloudpickle is a private API in Ray and should only be used for debugging purpose. Some functionalities of Ray would not work properly when using ray.cloudpickle alone.

I think what happened here was ray.cloudpickle.dumps pinned the Ray ObjectRef (via ray._private.serialization.SerializationContext.add_contained_object_ref). When used inside Ray, it will be unpinned later; but here it will be pinned forever because ray.cloudpickle.dumps is used alone.

Topic		Replies	Views
Using ray.cloudpickle methods in the serve deployment - error Ray Clusters	2	520	March 9, 2022
Unexplained memory usage with cloudpickle+obj store Ray Core	3	448	July 22, 2022
Ray.cloudpickle error while pickling ray.dataset Ray Core	11	1705	January 13, 2023
Facing issues when upgraded to ray 2.0.0 Ray Core	5	475	April 8, 2021
Intentionally not using GPU Ray Core	3	401	February 9, 2022

Garbage collection through cloudpicikle

Related topics