This is for our graph learning application use case. One-line summary is we had a graph backend service, which contains both graph topology T and node/edge features F, and now we are thinking of building the feature part F into Ray object store.
So here for features, the key is string (node_id or edge_id), and value is a list of float numbers. This portion of data is easy to apply a hash and store distributedly. Note that F uses much more bytes than topology T, and often dominates network bandwidth.
Why doing this? We are taking small steps to evolve into completely Ray-based. And for this step, we are hoping Ray’s object store caching mechanism can automatically take care of the data locality optimization for us. That is, those hot nodes’ feature get cached close to where the computation is, while the whole feature data F keeps distributed. And eventually we can improve this design to be auto-pipelined using DataSet.