Definition of "one timestep" when using ExternalEnv

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am training a DQN agent using the PolicyServer / PolicyClient with the default ExternalEnv similar to the cartpole example.

From my external simulator (written in Java) I call a wrapper API hosted with Ray Serve which uses the PolicyClient to interact with the PolicyServer. My training loop is as follows:

  1. client.start_episode
  2. client.get_action
  3. client.log_returns (12 times)
  4. client.get_action
  5. client.log_returns (12 times)
  6. client.end_episode

Between each client.get_action call I log the returns 12 times because the simulator takes a few timesteps until the full consequences of the action are visible. This can make sense because ExternalEnv accumulates the rewards until the next client.get_action call, correct?

My main question is how a “timestep” is defined in this scenario. Does one episode in this example consist of 2 timesteps (because get_action is called twice) or 24 timesteps because 24 returns are logged?

I am asking because I am confused by the “learning_starts” and “timesteps_per_iteration” options in the DQN config.

Why not just delay your call to get_action until the sim gets to a state where it’s ready for the next action?

This is what I am doing right now. I call get_action and wait until it is ready for the next action. And while it waits it logs the returns it gets from the environment.

But my question is if one timestep equals one call to get_action or one call to log_returns?

One time step equals a call to get/log actions. Logging returns does not advance the step.

There is one extra detail in multiagent configs where you can count steps by the number of environment steps or the number of agent steps. I do not think this was your question though. You can find more detail if you are interested in the link below. Search for multiagent.count_steps_by.
https://docs.ray.io/en/latest/rllib/rllib-sample-collection.html#the-samplecollector-class-is-used-to-store-and-retrieve-temporary-data

Here is where the commands are handled:

Here is the start of log_returns:

Thank you, this clarifies my question.