Environment Variable for CheckpointConfig

AleBitetto · February 28, 2024, 3:06pm

I’m running the following code to pass Environment Variable to Tuner

def train_model(config):
             .....

            # Create checkpoint.
            torch.save(
                {"epoch": epoch,
                 "model_state_dict": model.state_dict()},
                os.path.join(checkpoint_dir, f"{train.get_context().get_trial_name()}_model_epoch{str(epoch).zfill(8)}.pt"),
            )
            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            metrics = {"loss": running_loss / epoch_steps}
            train.report(metrics=metrics, checkpoint=checkpoint)


run_config = RunConfig(name="LSTM_AE",
                       storage_path=os.path.abspath(TUNE_CHECKPOINT),
                       checkpoint_config=CheckpointConfig(num_to_keep=1),
                       sync_config=SyncConfig(sync_artifacts=True,
                                             sync_timeout=600000),
                       log_to_file=True,
                       verbose=1
                      )

tune_resurce = {"cpu": 8}
      
ray.init(runtime_env={"env_vars": {"PYTHONWARNINGS": "ignore::DeprecationWarning",
                                  "TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S": "600"}
                     },
        ignore_reinit_error=True)    

trainable_with_resources = tune.with_resources(train_model, tune_resurce)
tuner = tune.Tuner(
    trainable=trainable_with_resources,
    tune_config=tune_config,
    param_space=param_space,
    run_config=run_config
)
results = tuner.fit()

but I still got the following warning

Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.

Same problem even if I comment the sync_config option

justinvyu · February 29, 2024, 1:25am

@AleBitetto You should decrease the threshold value to 0 if you want to completely suppress the warning.

Topic		Replies	Views
[Tune] How to turn off checkpointing for testing Ray Tune	20	2396	April 18, 2023
Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly Ray Core	3	128	March 26, 2024
WARNING syncer.py:585 -- Last sync command failed: Sync process failed Ray Libraries (Data, Train, Tune, Serve)	3	366	August 10, 2023
How to set directory where checkpoints are saved Ray Libraries (Data, Train, Tune, Serve)	2	203	December 14, 2023
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	394	October 18, 2021

Environment Variable for CheckpointConfig

Related Topics