Continue training of finished trials (Tune, RLLIB, PPO)

davidhozic · May 17, 2025, 2:20pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.10
Python version: 3.11.11
OS: Ubuntu 24.04.2 LTS
Cloud/Infrastructure: /
Other libs/tools (if relevant): /

3. What happened vs. what you expected:

I’ve been training some models using Population-based training and all of the trials have reached the terminated state as they have reached the stop condition.

Is there a way to update the stop condition from the saved state and continue training? E. .g., my trials all reached 10 M steps, but I would like to extend the training to 20 M, without having to retrain everything.

PhilippWillms · May 21, 2025, 6:25pm

You will need to experiment a bit around proper checkpointing, cf. following page in the API docs:

(How to Save and Load Trial Checkpoints — Ray 2.10.0)

I am not quite sure how this will behave in case of successfully terminated trial. Note that checkpointing is intended to save intermediate results from which you can restore in case of technical disruption (e.g. server crash) or sporadically failing iterations (e.g. due to exceptions occurring in episodes which are hard to detect).

The tune.Tuner.restore() explicitly excludes all terminated trials, cf. the API docs:

Finished trials are always added to the overview table. They will not be resumed.

(ray.tune.Tuner.restore — Ray 2.10.0)

davidhozic · May 30, 2025, 11:47pm

Yes, Tune won’t restart finished checkpoints. I’ve found a workaround by raising exceptions via a RLLIB callback and telling Ray to create a checkpoint via a Tune callback before the actual exception was made.

PhilippWillms · May 31, 2025, 9:38am

Ok, the RLLIB creates an exception to have an artificual “unresumed”, “errored” ending trial, which can be restored from checkpoint. Got it.

Topic		Replies	Views
Resume=True fails without useful error message RLlib	31	3162	September 26, 2022
Continue training after finishing first run RLlib	3	392	June 14, 2021
FileNotFoundError when resuming from Checkpoint Ray Tune	4	1311	August 11, 2022
Rllib checkpointing environment in Tune RLlib	1	420	June 2, 2022
Ray Train V2 with Ray Tune does not start another trial after a training run is TERMINATED Ray Train	3	18	April 17, 2025

Continue training of finished trials (Tune, RLLIB, PPO)

Related topics