I run the following Tuner:
os.environ["TUNE_MAX_PENDING_TRIALS_PG"] = "1"
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
os.environ["TUNE_RESULT_DIR"] = dirname_
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(train, X_original=X_original, y=y),
resources={"cpu": 10, "gpu": gpus_per_trial}
),
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
scheduler=scheduler,
num_samples=num_samples,
),
# run_config=run_config_,
param_space=config,
)
Where my train function has the following code within:
train():
... Define model & optimizer, etc.
# Load existing checkpoint through `get_checkpoint()` API.
if train.get_checkpoint():
loaded_checkpoint = train.get_checkpoint()
with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
model_state, optimizer_state = torch.load(
os.path.join(loaded_checkpoint_dir, "checkpoint.pt")
)
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
Epoch...
with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
# temp_checkpoint_dir = "F:/rayCheckpoint"
path = os.path.join(temp_checkpoint_dir, "checkpoint.pt")
torch.save(
(net.state_dict(), optimizer.state_dict()), path
)
checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)
train.report(
{"loss": (val_loss / val_steps), "accuracy": (correct / total)},
checkpoint=checkpoint,
)
But I have limited memory on my laptop and I want to save the checkpoints in a separate disk (“F:/rayCheckpoint”) instead of the custom file generated in the AppData/temp. If I use a runconfig in the tuner I’m not able to get the metrics from the checkpoint. Can anybody help me understand?