How many copies are occurred when getting an object from Plasma

Hi guys,

I would like to find out how many data copies are occurred when getting an object from Plasma. An example:

import pandas
import ray
ray.init()

df = pandas.DataFrame([1, 2, 3])
o_ref = ray.put(df)

df2 = ray.get(o_ref)

@ray.remote
def foo():
    df3 = ray.get(o_ref)
    ...

foo.remote()

Would df2 be a copy of df at the driver process and df3 at the worker process (2 copies)?

Thanks in advance!

cc @sangcho , any thoughts?

I think there should be only one copy when you do ray.put.

Would df2 be a copy of df at the driver process

https://docs.ray.io/en/master/serialization.html#numpy-arrays

So, when you do ray.put(df), the first copy happens (driver process memory → shared memory).

Numpy array that holds numeric data is zero-copy in Ray (I am not sure numpy array with string), and IIUC, pandas uses numpy array under the hood, so df should also be zero copyable (technically the numpy part of pandas dataframe is zero-copied). That says df2 = ray.get(o_ref) and df3 = ray.get(o_ref) doesn’t incur additional copy.

@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).

@sangcho , btw, what if two workers access to same data simultaneously? Will the accesses be serialized or one of the workers get a copy of the data?

@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).

Yes that’s correct. Technically when the object is transferred to other nodes + when the object is copied to the shared memory (from ray.put) for the first time. Note that if the object doesn’t support zero-copy serialization following the pickle 5 protocol PEP 574 -- Pickle protocol 5 with out-of-band data | Python.org the object is copied to the process memory when you call ray.get.

@sangcho , btw, what if two workers access to same data simultaneously? Will the accesses be serialized or one of the workers get a copy of the data?

Ray objects are immutable, so in this case both workers are just accessing the data in zero-copy manner. Again, note that if the object doesn’t support zero-copy serialization, they are just copied into the process memory.

Let’s say we have the following example.

o_ref = ray.put(df)

@ray.remote
def foo1(ref1):
    # some kind of processing with ref1

@ray.remote
def foo2(ref2):
    # some kind of processing with ref2

o_ref1 = foo1.remote(o_ref)
o_ref2 = foo2.remote(o_ref)

Will foo1 and foo2 be performed in parallel? Or won’t foo2 be started until foo1 doesn’t complete processing with the underlying data of o_ref?

@sangcho , a friendly reminder

Ray’s plasma objects are immutable (read-only). So, they are just performing in parallel (and there is no concern around locking and stuff). If zero-copy is not allowed, the objects are copied into the process memory, so having parallel processing on the same object has no impact

@sangcho , I see, thanks!