How to set directory where checkpoints are saved

SantaTitular · December 12, 2023, 6:15pm

I run the following Tuner:

os.environ["TUNE_MAX_PENDING_TRIALS_PG"] = "1"
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
os.environ["TUNE_RESULT_DIR"] = dirname_

tuner = tune.Tuner(
            tune.with_resources(
                tune.with_parameters(train, X_original=X_original, y=y),
                resources={"cpu": 10, "gpu": gpus_per_trial}
            ),
            tune_config=tune.TuneConfig(
                metric="loss",
                mode="min",
                scheduler=scheduler,
                num_samples=num_samples,
            ),
            # run_config=run_config_,
            param_space=config,
        )

Where my train function has the following code within:

train():
... Define model & optimizer, etc.
# Load existing checkpoint through `get_checkpoint()` API.
    if train.get_checkpoint():
        loaded_checkpoint = train.get_checkpoint()
        with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
            model_state, optimizer_state = torch.load(
                os.path.join(loaded_checkpoint_dir, "checkpoint.pt")
            )
            net.load_state_dict(model_state)
            optimizer.load_state_dict(optimizer_state)

Epoch...

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
        # temp_checkpoint_dir = "F:/rayCheckpoint"
            path = os.path.join(temp_checkpoint_dir, "checkpoint.pt")
            torch.save(
                (net.state_dict(), optimizer.state_dict()), path
            )
            checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)
            train.report(
                {"loss": (val_loss / val_steps), "accuracy": (correct / total)},
                checkpoint=checkpoint,
            )

But I have limited memory on my laptop and I want to save the checkpoints in a separate disk (“F:/rayCheckpoint”) instead of the custom file generated in the AppData/temp. If I use a runconfig in the tuner I’m not able to get the metrics from the checkpoint. Can anybody help me understand?

matthewdeng · December 13, 2023, 7:30pm

Could you explain more by this?

Setting RunConfig(path) will determine where the final checkpoints can be found after training is finished. See How to Configure Persistent Storage in Ray Tune — Ray 2.8.1 for details.

SantaTitular · December 14, 2023, 11:19am

Hi @matthewdeng , thanks for the reply!

I managed to understand the misunderstanding. Basically, I was setting the path of the Tuner on Runconfig (or “TUNE_RESULT_DIR” variable) but, afterwards, when selecting the best model and calling, e.g.:

results = tuner.fit()
best_result = results.get_best_result("loss", "min")
best_result.checkpoint.to_directory(),

I was forgetting to include the path argument best_result.checkpoint.to_directory(path=dirname_) and, thus, I was creating several folders on Temp directories. With this I thought that Tuner was not selecting the best result based on the metrics but I when to the code of the .to_directory method I understood that it created the folder based on the Result class.

Let me also ask the train.report method and return do the same w.r.t reporting the metrics to the tuner. Do they overwrite each other?

Topic		Replies	Views
Trial checkpointing	0	292	June 16, 2023
Tune results saved in ~/ray_results in addition to local storage_dir if TUNE_RESULT_DIR not set Ray Tune	5	1043	March 14, 2024
Issue in saving checkpoints	1	549	November 16, 2022
Nested checkpoint directories	1	223	June 13, 2023
[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint	4	434	September 29, 2022

How to set directory where checkpoints are saved

Related topics