I have a dataset with two columns: a timestamp and a value. The values in the timestamp column are unique.
I want to transform it to a dataset with each row:
timestamp, value, lag(value, 1), lag(value, 2), lag(value, 3), ...
where lag(value, n) means the value column nth timestamps previous.
So (first column is actually a timestamp type, but… easier to write just numbers):
1,10
2,34
4,22
7,66
13,10
17,29
So, if window size was 4, then what I would like is:
7, 66, 22, 34, 10
13, 10, 66, 22, 34
17, 29, 10, 66, 22
It’s a large dataset, so I’d like to do it in a distributed (possibly lazy) way. How might that best be done?