Plasma store APIs

sanjaysrikakulam · March 16, 2022, 11:15am

Hi,

After a long time, I am moving back to using Ray and Plasma store. I now found out that Plasma is not part of Arrow and that Ray has its own fork.

I was wondering if you have any plans to make Plasma available as a stand-alone package with the default APIs (The Plasma In-Memory Object Store — Apache Arrow v7.0.0), or if not as a stand-alone package at least exposing the Plasma with some of their default APIs which users can access by simply importing from ray import plasma?

Sorry, I could not find any thread with this information, so…

ericl · March 16, 2022, 10:12pm

Some context: We found it necessary to fork plasma in order to support certain features in Ray such as object spilling and distributed reference counting. These features require deep integration with the underlying memory store, which was not very maintainable with it as an external dependency.

Consequently, it doesn’t makes sense to expose the “raw” plasma API since using it directly would be bypassing the Ray object layer. It might make sense to expose more of it through the Ray API though. Is there specific functionality you need not provided by ray.put/get etc?

sanjaysrikakulam · March 18, 2022, 10:56am

Hi @ericl

Thanks for the information.

I use Plasma store as a backed for a key-value store. With the help of the existing APIs, I create the Object id’s (hash the key to 20 bytes digest and convert them to Object Ids) and then store the values into the store using put or depending on the situation I create and seal the buffer along with additional metadata and I also use other APIs as well. My key-value store uses as many features as possible via the Plasma API to find the number of entries in the Plasma store, the amount of memory used, amount of memory still available, etc.

So if Ray can support these features that will be really great (I am sure Ray’s Plasma store already has these features, it just needs to be exposed at least via the Ray API)

ericl · March 18, 2022, 6:06pm

One approach is to use a Ray actor and ray.put() to store data in the store:

@ray.remote
class KVActor:
   def __init__(self):
        self.data = {}
   def put(self, key, value): 
        self.data[key] = ray.put(value)
   def get(self, key) -> ObjectRef:
        return self.data[key]

This has the advantages of Ray (globally accessible in cluster, reference counting), without needing to directly access plasma.

You can also do stuff like bounding the memory usage by deleting old entries on put(), etc. in the KVActor. This can also be scaled out to multiple nodes by launching multiple actor instances.

sanjaysrikakulam · March 21, 2022, 8:41am

Cool, thanks for suggesting Ray actors I will have a look into them and see if they would suffice my requirements. Also is Ray now supporting out of band (zero-copy) for normal python objects like dicts and lists or is the zero-copy solution works only for certain Numpy objects only?

bybyte · December 16, 2022, 6:20pm

I am in the same boat as @sanjaysrikakulam. I find the store invaluable as a standalone tool and now that is deprecated from Arrow it would be great if it was supported from Ray. I understand that your fork is integrated with Ray and providing features such as object spilling and distributed reference counting but may it be worth considering its value standalone too? It’s a relevant tooi for shared memory usage where tools are lacking in general.

Topic		Replies	Views
Ray object store Ray Core	2	994	April 1, 2022
Putting objects to plasma Ray Core	4	375	February 2, 2021
Reading Data in parallel from file and pushing to the plasma object store Ray Core	4	982	April 1, 2021
Plasma usage across Nodes Ray Serve	2	732	March 8, 2022
Learning about shared memory Ray Core	6	1288	March 10, 2023

Plasma store APIs

Related topics