After reading the documention, I found that if ray.put
a numpy array and then ray.get
it, the array won’t be replicated and all the references pointed to the same memory address. But if we ray.put
a non-numpy object, the object will still be replicated though ray.get
it.
Am I misunderstand something?
If not, is there any method to prevent from replicating any object, i.e., share memory across multiple processes.
Thanks in advance!
You are right. Ray supports zero-copy-read for numpy (for float or integer types) array. Other data structures will be copied when they are ray.get by workers.
Also note that there are many solutions that use numpy array internally. For example, pandas is one of them (so if you use pandas, most of parts are zero-copy).
Thanks a lot. I’ll try what you’ve put forward
Are there any strategies for dealing with object containers of many strings, list of strings, dict of strings, Pandas String Index etc.?
I have found that transporting around large objects of strings can be a challenge.
Any possibility of using pyarrow or other strategies to optimize getting non-numerical data from the object store?
Good question. I don’t have much context here regarding our plan around this, but @suquark probably has some thoughts on this.
This is something that bugs me as well: why not at least support explicit memory sharing without any copying for some byte-based buffer structure?
I do not know if Ray creates a shared_memory.SharedMemory for a numpy array internally, but why not explicitly and directly support exchanging references to a SharedMemory buffer? The client could put anything in that as needed, including the byte representation of strings and other data. To use a numpy array instead looks like eing unnecessarily specific about the kind of data to exchange via shared memory.
Or is there some other proposed pattern for how to share large-memory (immutable) data structures between processes without the effort of encoding/decoding and copying?
why not at least support explicit memory sharing without any copying for some byte-based buffer structure?
Ray put “serialized” objects to the shared memory (and it doesn’t use python’s shared memory module, but we have our own built-in shared memory based distributed object store called plasma store that uses mmap). That says, things like Python built-in type that is hard to be used with the shared memory (since it should bypass the python memory management layer. If we’d like to support this, we should wrap the original python data type to our own type that is compatible to shared memory). The reason why numpy is supported by the zero copy serialization is because we use pickle 5 protocol, and it supports zero-copy serialization for objects that provide the proper interface (a.k.a. numpy). PEP 574 – Pickle protocol 5 with out-of-band data | peps.python.org. I believe the same thing applied to the python shared memory module, and you should use its own ShareableList
type which has many limitation.
why not explicitly and directly support exchanging references to a SharedMemory buffer
This bypasses our abstraction barrier and make it hard for Ray to keep tracking of memory usage which is used for scheduling, memory management, and etc…
Or is there some other proposed pattern for how to share large-memory (immutable) data structures between processes without the effort of encoding/decoding and copying?
I think one of possible solutions is to use an actor as a data holder and only obtain parts of data that is necessary through an actor.