Plasma store APIs

Hi,

After a long time, I am moving back to using Ray and Plasma store. I now found out that Plasma is not part of Arrow and that Ray has its own fork.

I was wondering if you have any plans to make Plasma available as a stand-alone package with the default APIs (The Plasma In-Memory Object Store — Apache Arrow v7.0.0), or if not as a stand-alone package at least exposing the Plasma with some of their default APIs which users can access by simply importing from ray import plasma?

Sorry, I could not find any thread with this information, so…

Some context: We found it necessary to fork plasma in order to support certain features in Ray such as object spilling and distributed reference counting. These features require deep integration with the underlying memory store, which was not very maintainable with it as an external dependency.

Consequently, it doesn’t makes sense to expose the “raw” plasma API since using it directly would be bypassing the Ray object layer. It might make sense to expose more of it through the Ray API though. Is there specific functionality you need not provided by ray.put/get etc?

Hi @ericl

Thanks for the information.

I use Plasma store as a backed for a key-value store. With the help of the existing APIs, I create the Object id’s (hash the key to 20 bytes digest and convert them to Object Ids) and then store the values into the store using put or depending on the situation I create and seal the buffer along with additional metadata and I also use other APIs as well. My key-value store uses as many features as possible via the Plasma API to find the number of entries in the Plasma store, the amount of memory used, amount of memory still available, etc.

So if Ray can support these features that will be really great (I am sure Ray’s Plasma store already has these features, it just needs to be exposed at least via the Ray API)

One approach is to use a Ray actor and ray.put() to store data in the store:

@ray.remote
class KVActor:
   def __init__(self):
        self.data = {}
   def put(self, key, value): 
        self.data[key] = ray.put(value)
   def get(self, key) -> ObjectRef:
        return self.data[key]

This has the advantages of Ray (globally accessible in cluster, reference counting), without needing to directly access plasma.

You can also do stuff like bounding the memory usage by deleting old entries on put(), etc. in the KVActor. This can also be scaled out to multiple nodes by launching multiple actor instances.

Cool, thanks for suggesting Ray actors I will have a look into them and see if they would suffice my requirements. Also is Ray now supporting out of band (zero-copy) for normal python objects like dicts and lists or is the zero-copy solution works only for certain Numpy objects only?