How to check if training is done

I’m trying to make a separate program that stops only when my RLlib training is over. (I’m using tune.run for training and thus am asking this question under Ray Tune category.)
Is there a way for me to keep checking whether the training is over?

First, I believe you have to define what you consider as “training is over”. I’m not aware of any formal definition for it in RL. That said, I think you may consider stop training when agent reaches some desirable performance (that can be achieved), or that its policy has very little entropy (randomness), or when training time surpasses the time you have available, or that training stops after a pre-defined number of training iterations.

In terms of API, you can use tune’s stopping condition (see Basic Python API), e.g.,

analysis = ray.tune.run(
    # ...
    stop=stop_criteria,
    # ...
)

I hope this helps.

Yeah, I should have made myself clearer.

I’m running multiple trials of PPO with Tune’s Population-Based Training. Now, this is happening in a Ray Cluster.
What I want to do is have another program run outside the Ray Cluster. This program starts running whenever I command, but has to automatically end the moment PBT experiment is over. i.e., when tune.run returns ExperimentalAnalysis object which basically means that the Tune experiment is over. So it’s not really when the “training is over,” but when the “Tune experiment is over.”

In terms of the stopping conditions via stop parameter, I usually use the 'time_total_s` equal to 72,000 seconds. The training itself of each trial in the PBT population does end after 72,000 seconds, but the actual total run time of PBT experiment is approximately 84,000 seconds. However, this experiment total time changes every run. So I doubt I can use the stopping condition to check if the experiment is over.

If there’s a way to check if a certain experiment is running inside a Ray Cluster from the outside, that would be wonderful. Then I can just keep calling a function to check if PBT is still running from my non-Ray code.

Sorry if this all sounds confusing. Please let me know and I’ll try my best to clear anything! Thanks for the help

I think I get your problem. I’m not sure if there is anything built-in in Ray designed for such need. Perhaps, you can monitor the job state from Ray dashboard (I’m not sure if there is any Python api to query it)? I image that you can also try (from the training thread) to write a file somewhere or call some REST service… Besides finishing termination, you probably want to manage failures or queued tasks too, I guess.

Have you considered writing your own training reporter (Console Output (Reporters) — Ray v2.0.0.dev0)?

I hope any of this helps.