Backdating rewards with PolicyClient

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am trying to train an agent in an external simulator hooked up to a PolicyClient. I would like to minimise the amount of data lost as the simulator will continue to run while the client is communicating with the server so my thoughts are to continuously collect data for rewards in the simulator and then only pass the final reward for that step once the next action has been taken. I have read that you can backdate the rewards using callbacks (possibly on_postprocess_trajectory or on_episode_end would be suitable?) but I can’t find much more information on it.

I also considered to only log the returns once the next action was taken but according to the docs if a second action is taken then the first action has an assumed reward of 0 so that won’t work.

Any ideas will be much appreciated.

Hi @theo

Usually, in a TCP environment, no data should be list.
I have trouble understanding what your goal is.
Can you also explain what backdating means in this context?