How many copies are occurred when getting an object from Plasma

Yaroslav_Igoshev · October 14, 2021, 12:14pm

Hi guys,

I would like to find out how many data copies are occurred when getting an object from Plasma. An example:

import pandas
import ray
ray.init()

df = pandas.DataFrame([1, 2, 3])
o_ref = ray.put(df)

df2 = ray.get(o_ref)

@ray.remote
def foo():
    df3 = ray.get(o_ref)
    ...

foo.remote()

Would df2 be a copy of df at the driver process and df3 at the worker process (2 copies)?

Thanks in advance!

Yaroslav_Igoshev · October 18, 2021, 8:55am

cc @sangcho , any thoughts?

sangcho · October 18, 2021, 11:50pm

I think there should be only one copy when you do ray.put.

Would df2 be a copy of df at the driver process

https://docs.ray.io/en/master/serialization.html#numpy-arrays

So, when you do ray.put(df), the first copy happens (driver process memory → shared memory).

Numpy array that holds numeric data is zero-copy in Ray (I am not sure numpy array with string), and IIUC, pandas uses numpy array under the hood, so df should also be zero copyable (technically the numpy part of pandas dataframe is zero-copied). That says df2 = ray.get(o_ref) and df3 = ray.get(o_ref) doesn’t incur additional copy.

YarShev · October 20, 2021, 11:18am

@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).

YarShev · October 20, 2021, 12:45pm

@sangcho , btw, what if two workers access to same data simultaneously? Will the accesses be serialized or one of the workers get a copy of the data?

sangcho · October 20, 2021, 12:58pm

@sangcho , thanks for the answer! That makes sense to me. A copy is occurred only when data transfer between two different nodes because each one manages own object store (Plasma).

Yes that’s correct. Technically when the object is transferred to other nodes + when the object is copied to the shared memory (from ray.put) for the first time. Note that if the object doesn’t support zero-copy serialization following the pickle 5 protocol PEP 574 – Pickle protocol 5 with out-of-band data | peps.python.org the object is copied to the process memory when you call ray.get.

@sangcho , btw, what if two workers access to same data simultaneously? Will the accesses be serialized or one of the workers get a copy of the data?

Ray objects are immutable, so in this case both workers are just accessing the data in zero-copy manner. Again, note that if the object doesn’t support zero-copy serialization, they are just copied into the process memory.

YarShev · October 20, 2021, 7:37pm

Let’s say we have the following example.

o_ref = ray.put(df)

@ray.remote
def foo1(ref1):
    # some kind of processing with ref1

@ray.remote
def foo2(ref2):
    # some kind of processing with ref2

o_ref1 = foo1.remote(o_ref)
o_ref2 = foo2.remote(o_ref)

Will foo1 and foo2 be performed in parallel? Or won’t foo2 be started until foo1 doesn’t complete processing with the underlying data of o_ref?

Yaroslav_Igoshev · October 27, 2021, 7:22am

@sangcho , a friendly reminder

sangcho · October 27, 2021, 7:46am

Ray’s plasma objects are immutable (read-only). So, they are just performing in parallel (and there is no concern around locking and stuff). If zero-copy is not allowed, the objects are copied into the process memory, so having parallel processing on the same object has no impact

YarShev · November 2, 2021, 7:44am

@sangcho , I see, thanks!

Topic		Replies	Views
Reading Data in parallel from file and pushing to the plasma object store Ray Core	4	990	April 1, 2021
@ray.remote function seemingly copying data from plasma store Ray Core	10	1075	March 27, 2021
Zero-copy deserialization with recursive dictionaries/lists Ray Core	1	651	August 3, 2021
Plasma usage across Nodes Ray Serve	2	733	March 8, 2022
Putting objects to plasma Ray Core	4	375	February 2, 2021

How many copies are occurred when getting an object from Plasma

Related topics