[PBT] Population-based Training early kill?

I’m running 6 trials with DQN using Tune’s PBT. Among those 6, two stopped training after 600 iterations, another two also stopped training after 800 iterations. The remaining two are currently training.
However, there’s no error regarding those stopped trials. Moreover, all of these “stopped” trials are logged as “RUNNING” as you can see below.

Below is the “episode_reward_max” graph. As you can see, those “stopped” trials were having awful rewards when they were stopped. I’m wondering if this something PBT automatically does: killing trials with low reward values. If so, where can I find the API regarding this? I can’t find anything about early kill from the Tune documentation.
By the way, the stopping standard is "time_total_s": 72_000, which none of the trials is even close to reaching.

cc @amogkam – any ideas about this?

I think actually this may just be an issue with Ray scheduling. Can you create a reproducible script so that I can try running this myself?

Posting this script on Github (for the Ray project issue page) would be most preferred!

@rliaw
I ran this with a custom simulator for a company. The simulator code cannot be shared.
I can still post the Ray related code though. Can you still help me without the custom simulator?

@Kai_Yun for sure. Absolutely. Can you also reproduce this on say Atari benchmarks? That would be most effective for us to reproduce.