Use distributed object store for storing key-value pairs?

HuangLED · December 6, 2021, 7:36pm

I understand ray.get()/put() is for key being object references. Now I have a large(~1TB) key-value data that I would like to store distributedly, how can I make use of object store for this purpose? (in this application data, both key and value are small strings).

One alternative of course is to use dedicated distributed kv storage, but wonder it makes sense to contain everything inside Ray.

rliaw · December 6, 2021, 7:40pm

@HuangLED Can you tell me more about your use case?

cc @ericl @Alex

HuangLED · December 6, 2021, 8:08pm

Thanks Richard!

This is for our graph learning application use case. One-line summary is we had a graph backend service, which contains both graph topology T and node/edge features F, and now we are thinking of building the feature part F into Ray object store.

So here for features, the key is string (node_id or edge_id), and value is a list of float numbers. This portion of data is easy to apply a hash and store distributedly. Note that F uses much more bytes than topology T, and often dominates network bandwidth.

Why doing this? We are taking small steps to evolve into completely Ray-based. And for this step, we are hoping Ray’s object store caching mechanism can automatically take care of the data locality optimization for us. That is, those hot nodes’ feature get cached close to where the computation is, while the whole feature data F keeps distributed. And eventually we can improve this design to be auto-pipelined using DataSet.

HuangLED · December 11, 2021, 5:45am

Is there any other information I can provide for the use case?

ericl · December 15, 2021, 12:28am

If there is a natural partitioning of the data, you can consider storing the object as a series of chunks and retrieve the right chunk per lookup via ray.get(), and that would give you locality of data transfers. Unfortunately Ray doesn’t have higher level abstractions for this type of use case right now.

HuangLED · December 17, 2021, 7:47pm

Got it. Given this, we will definitely need to create a mapping of (vertex → chunk) then.

Will need to play with how big the chunk should be. One follow up question is, if we create a hash_map for this purpose, is there performance concern when the hash_map is too large? (say 2B of vertex, even put into chunks the key space could still be large). Worst case is that we end up with one copy of hash_map on each machine. But I wonder if Ray somehow is already smart enough to store hash_map(s) distributedly?

j1ah0ng · June 28, 2025, 11:00pm

Hey @HuangLED! I’m curious if you ended up with any results here. We’re in a similar boat where we want to mirror a key/value dataset semi-locally within the cluster.

For a bit more context, the existing solution fetches directly from S3, but due to horizontal scale concerns we want to cache the results from S3 so we don’t hit rate limiting on our bucket. Given our workflow already runs in a Ray cluster it feels quite natural to use Ray tooling for this, as opposed to say another distributed k/v solution.

Topic		Replies	Views
Data Retrieval Best Practices Ray Client	7	658	April 11, 2023
Maximize available capacity for embarrassingly parallel workloads? Ray Core	4	358	January 10, 2022
At what size will objects no longer use local storage Ray Core	5	418	February 19, 2022
How to create a Ray dataset from distributed partitions? Ray Data	7	825	April 5, 2023
Is the redis service in ray cluster a distributed or standalone? Ray Core	1	289	February 10, 2022

Use distributed object store for storing key-value pairs?

Related topics