1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.10
- Python version: 3.11.11
- OS: Ubuntu 24.04.2 LTS
- Cloud/Infrastructure: /
- Other libs/tools (if relevant): /
3. What happened vs. what you expected:
I’ve been training some models using Population-based training and all of the trials have reached the terminated state as they have reached the stop condition.
Is there a way to update the stop condition from the saved state and continue training? E. .g., my trials all reached 10 M steps, but I would like to extend the training to 20 M, without having to retrain everything.
You will need to experiment a bit around proper checkpointing, cf. following page in the API docs:
(How to Save and Load Trial Checkpoints — Ray 2.10.0)
I am not quite sure how this will behave in case of successfully terminated trial. Note that checkpointing is intended to save intermediate results from which you can restore in case of technical disruption (e.g. server crash) or sporadically failing iterations (e.g. due to exceptions occurring in episodes which are hard to detect).
The tune.Tuner.restore()
explicitly excludes all terminated trials, cf. the API docs:
Finished trials are always added to the overview table. They will not be resumed.
(ray.tune.Tuner.restore — Ray 2.10.0)
Yes, Tune won’t restart finished checkpoints. I’ve found a workaround by raising exceptions via a RLLIB callback and telling Ray to create a checkpoint via a Tune callback before the actual exception was made.
Ok, the RLLIB creates an exception to have an artificual “unresumed”, “errored” ending trial, which can be restored from checkpoint. Got it.