How to check if training is done

Kai_Yun · July 1, 2021, 1:51am

I’m trying to make a separate program that stops only when my RLlib training is over. (I’m using tune.run for training and thus am asking this question under Ray Tune category.)
Is there a way for me to keep checking whether the training is over?

GattiPinheiro · July 1, 2021, 7:09am

First, I believe you have to define what you consider as “training is over”. I’m not aware of any formal definition for it in RL. That said, I think you may consider stop training when agent reaches some desirable performance (that can be achieved), or that its policy has very little entropy (randomness), or when training time surpasses the time you have available, or that training stops after a pre-defined number of training iterations.

In terms of API, you can use tune’s stopping condition (see Basic Python API), e.g.,

analysis = ray.tune.run(
    # ...
    stop=stop_criteria,
    # ...
)

I hope this helps.

Kai_Yun · July 1, 2021, 7:40am

Yeah, I should have made myself clearer.

I’m running multiple trials of PPO with Tune’s Population-Based Training. Now, this is happening in a Ray Cluster.
What I want to do is have another program run outside the Ray Cluster. This program starts running whenever I command, but has to automatically end the moment PBT experiment is over. i.e., when tune.run returns ExperimentalAnalysis object which basically means that the Tune experiment is over. So it’s not really when the “training is over,” but when the “Tune experiment is over.”

In terms of the stopping conditions via stop parameter, I usually use the 'time_total_s` equal to 72,000 seconds. The training itself of each trial in the PBT population does end after 72,000 seconds, but the actual total run time of PBT experiment is approximately 84,000 seconds. However, this experiment total time changes every run. So I doubt I can use the stopping condition to check if the experiment is over.

If there’s a way to check if a certain experiment is running inside a Ray Cluster from the outside, that would be wonderful. Then I can just keep calling a function to check if PBT is still running from my non-Ray code.

Sorry if this all sounds confusing. Please let me know and I’ll try my best to clear anything! Thanks for the help

GattiPinheiro · July 5, 2021, 7:04am

I think I get your problem. I’m not sure if there is anything built-in in Ray designed for such need. Perhaps, you can monitor the job state from Ray dashboard (I’m not sure if there is any Python api to query it)? I image that you can also try (from the training thread) to write a file somewhere or call some REST service… Besides finishing termination, you probably want to manage failures or queued tasks too, I guess.

Have you considered writing your own training reporter (Console Output (Reporters) — Ray v2.0.0.dev0)?

I hope any of this helps.

Topic		Replies	Views
Questions about tune stopping condition with PBT	1	435	February 27, 2023
Question - About tune stopping condition with PBT	6	505	February 21, 2023
Stop programmatically using running time Ray Tune	1	512	November 16, 2022
Continue training after finishing first run RLlib	3	393	June 14, 2021
Stop experiment, but finish currently running trials Ray Tune	7	435	February 21, 2023

How to check if training is done

Related topics