So, when you do ray.put(df), the first copy happens (driver process memory → shared memory).
Numpy array that holds numeric data is zero-copy in Ray (I am not sure numpy array with string), and IIUC, pandas uses numpy array under the hood, so df should also be zero copyable (technically the numpy part of pandas dataframe is zero-copied). That says df2 = ray.get(o_ref) and df3 = ray.get(o_ref) doesn’t incur additional copy.
@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).
@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).
Yes that’s correct. Technically when the object is transferred to other nodes + when the object is copied to the shared memory (from ray.put) for the first time. Note that if the object doesn’t support zero-copy serialization following the pickle 5 protocol PEP 574 – Pickle protocol 5 with out-of-band data | peps.python.org the object is copied to the process memory when you call ray.get.
@sangcho , btw, what if two workers access to same data simultaneously? Will the accesses be serialized or one of the workers get a copy of the data?
Ray objects are immutable, so in this case both workers are just accessing the data in zero-copy manner. Again, note that if the object doesn’t support zero-copy serialization, they are just copied into the process memory.
o_ref = ray.put(df)
@ray.remote
def foo1(ref1):
# some kind of processing with ref1
@ray.remote
def foo2(ref2):
# some kind of processing with ref2
o_ref1 = foo1.remote(o_ref)
o_ref2 = foo2.remote(o_ref)
Will foo1 and foo2 be performed in parallel? Or won’t foo2 be started until foo1 doesn’t complete processing with the underlying data of o_ref?
Ray’s plasma objects are immutable (read-only). So, they are just performing in parallel (and there is no concern around locking and stuff). If zero-copy is not allowed, the objects are copied into the process memory, so having parallel processing on the same object has no impact