How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi, I use your GTrXL implementation and notice that during inference the input shapes are of [batch_size, 1 (seq_len), feature_size].
But as far as i know, the whole point of these attention networks is that it has attention over a longer sequence length instead of 1. So it can evaluate and “look at” the whole sequence at once.
Why is it implemented as it is? If the sequence length would be days, we should not feed one day at a time but a couple of days together right?
Hope to hear from you soon!