Backdating rewards with PolicyClient

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am trying to train an agent in an external simulator hooked up to a PolicyClient. I would like to minimise the amount of data lost as the simulator will continue to run while the client is communicating with the server so my thoughts are to continuously collect data for rewards in the simulator and then only pass the final reward for that step once the next action has been taken. I have read that you can backdate the rewards using callbacks (possibly on_postprocess_trajectory or on_episode_end would be suitable?) but I can’t find much more information on it.

I also considered to only log the returns once the next action was taken but according to the docs if a second action is taken then the first action has an assumed reward of 0 so that won’t work.

Any ideas will be much appreciated.

Hi @theo

Usually, in a TCP environment, no data should be list.
I have trouble understanding what your goal is.
Can you also explain what backdating means in this context?

Cheers

Hi @arturn,

I apologise for the late reply I hadn’t seen this until now. I managed to solve the issue using the on_postprocess_trajectory callback. For the sake of any one else who stumbles across this, here is the issue I was trying to solve.

The simulator runs in real time and doesn’t wait for latency caused by the network, inference from the algorithm etc. This would mean that packets of data from the simulator will not be examined which means we lose some data about the result of an action (see the image below)

The solution is to keep monitoring the simulator until the instant that a new action is applied and then to calculate the rewards but this will mean that the reward received doesn’t apply to the action taken. (see below)

So by using the on_postprocess_trajectory callback, I can take all the rewards for an episode and shift them backwards by 1 so the correct reward is associated with an action.

1 Like