Algorithm.train does not terminate for custom env

I use an environment generated by NS3-Gym and am experiencing an issue where RLlib does not terminate after the number of steps I set it to.
For testing purposes I set rollout_fragment_length, train_batch_size, and horizon all to 1 and batch_mode to complete_episodes (although, from my understanding, this shouldn’t make a difference if horizon is 1).

I am now expecting the agent to execute 1 step (= 1 episode) and then terminate. When using one of the default environments (I used CartPole-v1 for testing) this seems to work fine as the whole script finishes in about 20 s. However, when I’m using my own environment, the env just gets reset after 1 step and train() continues to run. I know that RLlib recognizes that a step has passed, since a) it executed the step (duh) and b) the environment normally would not reset after 1 step. I’m honestly at a loss as to why it just won’t stop training so any help would be greatly appreciated.

Attached below is the my training script. Since the env is bound to an external NS3 simulation, this won’t run on its own but I figured it still can’t hurt to attach it.

import ray
import ray.rllib.algorithms.dqn as dqn
from ray.tune.registry import register_env
import gym
from ns3gym import ns3env

def env_creator (env_config) -> gym.Env:
    env = ns3env.Ns3Env(port=0, startSim=True)
    return env

register_env("my_env", env_creator)

config = dqn.DEFAULT_CONFIG.copy()
config["disable_env_checking"] = True  # Reset time is quite high so I disabled env checking
config["rollout_fragment_length"] = 1
config["batch_mode"] = "complete_episodes"
config["train_batch_size"] = 1
config["horizon"] = 1

algo = dqn.DQN(config=config, env="my_env")

I’m not sure if I understand.
Your goal is to reset the env after one step.
So as per the following sentence, you’ve achieved that already:

However, when I’m using my own environment, the env just gets reset after 1 step and train() continues to run.

But now train() continues to run? What’s the issue with that? That is intended behaviour as trainings generally continue for ever if you don’t introduce stopping conditions.

Maybe I phrased my question wrong but doesn’t train usually only execute one episode?
That’s what I was experiencing with the CartPole env at least.
Doesn’t really matter since I’ve gotten this to work by switching from the conf dictionary to using DQNConfig. Cheers for the reply anyway!

1 Like