Use distributed object store for storing key-value pairs?

I understand ray.get()/put() is for key being object references. Now I have a large(~1TB) key-value data that I would like to store distributedly, how can I make use of object store for this purpose? (in this application data, both key and value are small strings).

One alternative of course is to use dedicated distributed kv storage, but wonder it makes sense to contain everything inside Ray.

1 Like

@HuangLED Can you tell me more about your use case?

cc @ericl @Alex

Thanks Richard!

This is for our graph learning application use case. One-line summary is we had a graph backend service, which contains both graph topology T and node/edge features F, and now we are thinking of building the feature part F into Ray object store.

So here for features, the key is string (node_id or edge_id), and value is a list of float numbers. This portion of data is easy to apply a hash and store distributedly. Note that F uses much more bytes than topology T, and often dominates network bandwidth.

Why doing this? We are taking small steps to evolve into completely Ray-based. And for this step, we are hoping Ray’s object store caching mechanism can automatically take care of the data locality optimization for us. That is, those hot nodes’ feature get cached close to where the computation is, while the whole feature data F keeps distributed. And eventually we can improve this design to be auto-pipelined using DataSet.

Is there any other information I can provide for the use case? :grinning:

If there is a natural partitioning of the data, you can consider storing the object as a series of chunks and retrieve the right chunk per lookup via ray.get(), and that would give you locality of data transfers. Unfortunately Ray doesn’t have higher level abstractions for this type of use case right now.

1 Like

Got it. Given this, we will definitely need to create a mapping of (vertex → chunk) then.

Will need to play with how big the chunk should be. One follow up question is, if we create a hash_map for this purpose, is there performance concern when the hash_map is too large? (say 2B of vertex, even put into chunks the key space could still be large). Worst case is that we end up with one copy of hash_map on each machine. But I wonder if Ray somehow is already smart enough to store hash_map(s) distributedly? :grinning: