I did an interesting experiment. Start two servers with GPUs attached. Set one of them to be head node one of them to worker node, etc.
Now on the worker node run the following code:
ray.init("auto").
a = torch.randn(10,1024,1024).cuda()
z = ray.put(a)
ray.cloudpickle.dumps(z)
Now paste this output to the head node:
a = ray.get(ray.cloudpickle.loads(OUTPUT_FROM_ABOVE))
a is now auto-magically a Torch cuda Tensor. Bravo ray.get!
But now the memory footprint, ~40MB, is on both machines! I have found no way to free this 40MB from either machine. Python del doesn’t work. ray.internal.internal_api.free doesn’t work.
Is it because I am doing something bad with cloudpickle and it breaks Ray’s reference counting?