Definition of "one timestep" when using ExternalEnv

phipag · June 19, 2022, 9:26am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I am training a DQN agent using the PolicyServer / PolicyClient with the default ExternalEnv similar to the cartpole example.

From my external simulator (written in Java) I call a wrapper API hosted with Ray Serve which uses the PolicyClient to interact with the PolicyServer. My training loop is as follows:

client.start_episode
client.get_action
client.log_returns (12 times)
client.get_action
client.log_returns (12 times)
client.end_episode

Between each client.get_action call I log the returns 12 times because the simulator takes a few timesteps until the full consequences of the action are visible. This can make sense because ExternalEnv accumulates the rewards until the next client.get_action call, correct?

My main question is how a “timestep” is defined in this scenario. Does one episode in this example consist of 2 timesteps (because get_action is called twice) or 24 timesteps because 24 returns are logged?

I am asking because I am confused by the “learning_starts” and “timesteps_per_iteration” options in the DQN config.

rusu24edward · June 21, 2022, 5:55pm

Why not just delay your call to get_action until the sim gets to a state where it’s ready for the next action?

phipag · June 22, 2022, 8:18am

This is what I am doing right now. I call get_action and wait until it is ready for the next action. And while it waits it logs the returns it gets from the environment.

But my question is if one timestep equals one call to get_action or one call to log_returns?

mannyv · June 22, 2022, 10:30am

One time step equals a call to get/log actions. Logging returns does not advance the step.

There is one extra detail in multiagent configs where you can count steps by the number of environment steps or the number of agent steps. I do not think this was your question though. You can find more detail if you are interested in the link below. Search for multiagent.count_steps_by.
https://docs.ray.io/en/latest/rllib/rllib-sample-collection.html#the-samplecollector-class-is-used-to-store-and-retrieve-temporary-data

Here is where the commands are handled:

github.com

ray-project/ray/blob/f18fc31c7562990955556899090f8e8656b48d2d/rllib/env/policy_server_input.py#L211-L243


      
          # Remote inference commands:
          elif command == PolicyClient.START_EPISODE:
              setup_child_rollout_worker()
              assert inference_thread.is_alive()
              response["episode_id"] = child_rollout_worker.env.start_episode(
                  args["episode_id"], args["training_enabled"]
              )
          elif command == PolicyClient.GET_ACTION:
              assert inference_thread.is_alive()
              response["action"] = child_rollout_worker.env.get_action(
                  args["episode_id"], args["observation"]
              )
          elif command == PolicyClient.LOG_ACTION:
              assert inference_thread.is_alive()
              child_rollout_worker.env.log_action(
                  args["episode_id"], args["observation"], args["action"]
              )
          elif command == PolicyClient.LOG_RETURNS:
              assert inference_thread.is_alive()
              if args["done"]:

This file has been truncated. show original

Here is the start of log_returns:

github.com

ray-project/ray/blob/b5bc2b93c33f0f475af69dd6eca656dcf264612d/rllib/env/external_env.py#L154


      
              Args:
                  episode_id: Episode id returned from start_episode().
                  observation: Current environment observation.
                  action: Action for the observation.
              """
          
          
    episode = self._get(episode_id)
              episode.log_action(observation, action)
          
          
@PublicAPI
          def log_returns(
              self, episode_id: str, reward: float, info: Optional[EnvInfoDict] = None
          ) -> None:
              """Records returns (rewards and infos) from the environment.
          
          
    The reward will be attributed to the previous action taken by the
              episode. Rewards accumulate until the next action. If no reward is
              logged before the next action, a reward of 0.0 is assumed.
          
          
    Args:
                  episode_id: Episode id returned from start_episode().

phipag · June 23, 2022, 1:20pm

Thank you, this clarifies my question.

Topic		Replies	Views
'timesteps_per_iteration' parameter RLlib	1	792	July 21, 2021
ExternalMultiAgentEnv dynamics RLlib	0	9	January 14, 2025
Use only specific timesteps during agent training RLlib	3	492	June 21, 2022
'client.end_episode()' don't make any difference RLlib	3	651	July 26, 2022
External Env crashes during training step RLlib	3	443	November 4, 2021

Definition of "one timestep" when using ExternalEnv

Related topics