RLLib steps being sampled and trained but episode count is zero and reward metrics are nan

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a custom environment which I’m training using the SAC config and tuner.fit(). My environment has a max_episode_steps = 200 variable. When steps reach this count, truncated is True. I assume this means that it counts it as an episode and reward metrics should be calclated. However, episode count in progress.csv is always 0 even though steps sampled and trained is increasing. All the reward metrics like env_runner/episode_reward_mean are all NaN as well. hist_stats is also empty.

Here is my step function:

def step(self, action):
        self.steps += 1
        print(self.steps)
        action = 2**action

        actions = {self.agentId: action}

        if math.isnan(action):
            print("====================================== action passed is nan =========================================")
        
        print("STEPS: " + str(self.steps))
        obs, rewards, dones, info_= self.runner.step(actions)

        for key, value in obs.items():
            obs[key] = np.asarray(value, dtype=np.float32)
        
        print("observations: ",obs)
        print("dones:", dones)
        print("info:",info_)
        print("rewards:", rewards)
        
        if dones[self.agentId]:
             self.runner.shutdown()
             self.runner.cleanup()

        if math.isnan(rewards[self.agentId]):
            print("====================================== reward returned is nan =========================================")
        reward = round(rewards[self.agentId],4)
        print("REWARD: " + str(reward))
        if any(np.isnan(np.asarray(obs[self.agentId], dtype=np.float32))):
            print("====================================== obs returned is nan =========================================")
        
        # completion = defaultdict(int)
        obs = obs[self.agentId]
        self.currentRecord = obs
        self.obs.extend(obs)
        obs = np.asarray(list(self.obs),dtype=np.float32)

        if info_['simDone']:
             dones[self.agentId] = True

        if self.steps >= self.max_episode_steps:
            truncated = True
            self.runner.shutdown()
            self.runner.cleanup()
     
        else:
            truncated = False

         
        return  obs, reward, dones[self.agentId], truncated, {} #reward, dones[self.agentId],truncated, {}

My environment could be reset but it takes a while when I set a condition within the environment to set done to True, it takes very long and not even a single training iteration completes even though the environment is resetting.

Here is my algorithm configuration:

config = (SACConfig()
    .env_runners(num_env_runners=2, rollout_fragment_length=200) 
    .resources(num_gpus=1)
    .environment("OmnetppEnv", env_config=env_config)
    .evaluation(evaluation_config=evaluation_config,))

tuner = tune.Tuner(
        "SAC",
        
        run_config=air.RunConfig(stop={"timesteps_total": 10000}, 
                                 name=f"SAC_1",
                                 checkpoint_config=air.CheckpointConfig(checkpoint_frequency=100,
                                                                        checkpoint_at_end=True
                                                                        ),
                        ),
        param_space=config
        
    )

    results = tuner.fit()

Things I’ve already tried:

  • used batch_mode = completed_episodes
  • set terminated to True after some steps. This makes it so that not even a single training iteration is completed and progress.csv is not even created. I think a worker restarts before it can reach terminated = True but still before it can terminate.
  • used an eval_env_runner and used evaluation_duration_unit = ‘timesteps’. This kept making the worker crash.
config = (SACConfig()
    debugging(seed=1)
    .env_runners(num_env_runners=2, rollout_fragment_length=200) #, batch_mode="complete_episodes", horizon=400)
    .resources(num_gpus=1)
    .environment("OmnetppEnv", env_config=env_config)
    .evaluation(evaluation_config=evaluation_config,
                evaluation_num_env_runners=1,
                evaluation_interval=1,  # Evaluate every 5 training iterations
                evaluation_duration=200,
                evaluation_duration_unit="timesteps",)
                evaluation_sample_timeout_s=None,)
                evaluation_force_reset_envs_before_iteration=True)
     ))
  • output print(episodes) in metrics.py, It is always 0 so I think the problem is with episodes not being counted but I’m not sure how to fix it.