Training getting stuck; Tune is running, but no more episodes occur

kia · May 2, 2021, 2:34pm

I have been using tune.run, and it has been working fine, but as of this morning, it gets ‘stuck’ after very few episodes - even though my stopping criterion has not changed. So an example would be, even though it still ‘appears’ to be training, the results show:

mannyv · May 3, 2021, 1:34am

Hi @kia,

It looks like your environment is not terminating.If I remember correctly, the episode values (rew, len, etc…) are not logged until an episode ends.Without knowing more about your environment it is not possible to diagnose but it looks like the agent’s are in a state where they are not triggering the termination conditions for your environment.

Have you looked in tensorboard to see how the losses are behaving? Is it possible that the agents could end up in a buggy state in the environment were it cannot end? Is it a case where one agent gets a large negative reward when the environment finishes and it has learned a behavior that prevents the other agent from winning to prevent this. Can you “watch” what the agent’s are doing in an environment?

There is an rllib config entry you could use, “horizon” that will cause rllib to artificially terminate your episode after a max number of steps if that is desired. You can read more about it here : RLlib Sample Collection and Trajectory Views — Ray v2.0.0.dev0

Topic		Replies	Views
Stopping condition in Tune confusion RLlib	1	529	March 24, 2022
How to tell RLLIB tune to run that many number of episodes RLlib	1	207	August 14, 2021
Tune.run() doesn't work. runs endlessly Ray Tune stopping condition & comparisons	1	546	November 2, 2023
Ray rllib tune.run() stuck in running RLlib	2	358	May 24, 2023
My tuner cannot stop as expected RLlib	2	302	March 24, 2023

Training getting stuck; Tune is running, but no more episodes occur

Related topics