How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
I am training a DQN agent using the PolicyServer / PolicyClient with the default ExternalEnv similar to the cartpole example.
From my external simulator (written in Java) I call a wrapper API hosted with Ray Serve which uses the PolicyClient to interact with the PolicyServer. My training loop is as follows:
client.start_episode
client.get_action
client.log_returns (12 times)
client.get_action
client.log_returns (12 times)
client.end_episode
Between each client.get_action call I log the returns 12 times because the simulator takes a few timesteps until the full consequences of the action are visible. This can make sense because ExternalEnv accumulates the rewards until the next client.get_action call, correct?
My main question is how a “timestep” is defined in this scenario. Does one episode in this example consist of 2 timesteps (because get_action is called twice) or 24 timesteps because 24 returns are logged?
I am asking because I am confused by the “learning_starts” and “timesteps_per_iteration” options in the DQN config.
This is what I am doing right now. I call get_action and wait until it is ready for the next action. And while it waits it logs the returns it gets from the environment.
But my question is if one timestep equals one call to get_action or one call to log_returns?