[Ray Dataset] Shifting data/Lag features

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all!

I am trying to understand how to create feature lags on my dataframe which is usually done with groupby + shift (e.g. group by customer id and shift so that you build features about what the customer bought one period ago, 2 periods, etc…)

Any way this can be done in Ray Dataset? I don’t see shift in the groupby methods, and this does not look like it could be done with map or map_batches? (as shift would need to access the first/last item from a different batch).

Any way to shift data on Ray Dataset? Or am I forced to convert to Dask (and make a copy of the data, if my understanding is correct?)

1 Like

Just discovered that Ray datasets has a map_groups function so now I’m assuming you can achieve this by grouping and mapping the groups to a pandas shift function. Will give it a try and report back.

1 Like

Hi @Andrea_Pisoni - thanks for question. Yes you can use map_groups to keep first and last item per group, and then do map_batches on grouped data. Let us know how it works. thanks. We don’t support shift natively now.

Hi Chengsu,

Thanks so much for the suggestion. Can you expand on what you mean?

I thought I would use the pandas.shift function on map_groups directly. How would you use map_batches instead? Each batch is not guaranteed to be a group right? How will that work if a group for example is split across three batches?

Hi @Andrea_Pisoni - actually after thinking again, I think use map_groups with pandas.shift should work, I think you need to keep the previous group in memory, so for each group you know how to shift across the groups, right?