RLLib steps being sampled and trained but episode count is zero and reward metrics are nan

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a custom environment which I’m training using the SAC config and tuner.fit(). My environment has a max_episode_steps = 200 variable. When steps reach this count, truncated is True. I assume this means that it counts it as an episode and reward metrics should be calclated. However, episode count in progress.csv is always 0 even though steps sampled and trained is increasing. All the reward metrics like env_runner/episode_reward_mean are all NaN as well. hist_stats is also empty.

Here is my step function:

def step(self, action):
        self.steps += 1
        action = 2**action

        actions = {self.agentId: action}

        if math.isnan(action):
            print("====================================== action passed is nan =========================================")
        print("STEPS: " + str(self.steps))
        obs, rewards, dones, info_= self.runner.step(actions)

        for key, value in obs.items():
            obs[key] = np.asarray(value, dtype=np.float32)
        print("observations: ",obs)
        print("dones:", dones)
        print("rewards:", rewards)
        if dones[self.agentId]:

        if math.isnan(rewards[self.agentId]):
            print("====================================== reward returned is nan =========================================")
        reward = round(rewards[self.agentId],4)
        print("REWARD: " + str(reward))
        if any(np.isnan(np.asarray(obs[self.agentId], dtype=np.float32))):
            print("====================================== obs returned is nan =========================================")
        # completion = defaultdict(int)
        obs = obs[self.agentId]
        self.currentRecord = obs
        obs = np.asarray(list(self.obs),dtype=np.float32)

        if info_['simDone']:
             dones[self.agentId] = True

        if self.steps >= self.max_episode_steps:
            truncated = True
            truncated = False

        return  obs, reward, dones[self.agentId], truncated, {} #reward, dones[self.agentId],truncated, {}

My environment could be reset but it takes a while when I set a condition within the environment to set done to True, it takes very long and not even a single training iteration completes even though the environment is resetting.

Here is my algorithm configuration:

config = (SACConfig()
    .env_runners(num_env_runners=2, rollout_fragment_length=200) 
    .environment("OmnetppEnv", env_config=env_config)

tuner = tune.Tuner(
        run_config=air.RunConfig(stop={"timesteps_total": 10000}, 

    results = tuner.fit()

Things I’ve already tried:

  • used batch_mode = completed_episodes
  • set terminated to True after some steps. This makes it so that not even a single training iteration is completed and progress.csv is not even created. I think a worker restarts before it can reach terminated = True but still before it can terminate.
  • used an eval_env_runner and used evaluation_duration_unit = ‘timesteps’. This kept making the worker crash.
config = (SACConfig()
    .env_runners(num_env_runners=2, rollout_fragment_length=200) #, batch_mode="complete_episodes", horizon=400)
    .environment("OmnetppEnv", env_config=env_config)
                evaluation_interval=1,  # Evaluate every 5 training iterations
  • output print(episodes) in metrics.py, It is always 0 so I think the problem is with episodes not being counted but I’m not sure how to fix it.