Point-in-time joins for Ray Datasets?

How do I do point-in-time joins with Ray Datasets? Is there a built-in function or an example pipeline anywhere? E.g. I want to combine multiple datasets similar to this picture:

so final data set has proper rows

Hi @dirtyValera, can you share a bit more how does your “combine” logic look like?

Currently in Datasets, the supported “combine” is simply concatenating two Datasets, with ds.union() API.
If you are looking for SQL like join, it’s not supported yet. You may check more details about this discussion in [Datasets] [Feature] Support joins/merges · Issue #18911 · ray-project/ray · GitHub.

Thanks for the answer @jianxiao!

Yes, I mean joining datasets based on timestamps (similar to pandas merge_asof and using previous values instead of NaNs) What about dask_on_ray? AFAIK Dask allows joins, is it possible to transform existing Ray Datasets to Dask Dataframes, perform joins and transform back to Ray dataset?

Yes, Ray Datasets supports that:

  1. Convert Dataset to Dask DataFrame: dask_df = ds.to_dask();
  2. Conver Dask to Dataset: ds = ray.data.from_dask(dask_df).

You may check the details in API reference: Input/Output — Ray 3.0.0.dev0