Hi @BaamPark,
Thanks for making this detailed post!
Q1: Where are the results logged to?
The default logging directory is set to the ~/ray_results/
in the local filesystem. This can be modified through ray.air.RunConfig(storage_path)
passed into the tune.Tuner
. See Tune Execution (tune.Tuner) — Ray 2.3.1 and ray.air.RunConfig — Ray 2.3.1.
By default, the experiment directory name will be uniquely generated with a timestamp, but this can also be set manually (to something meaningful) withRunConfig(name)
.
Within the actual Tune run, we download the dataset to "./data"
. This is a relative path, which defaults to be relative to the trial directory. This means that each trial (e.g., you are performing a search using Tune search spaces) will download the data separately.
If you want to use a single download folder across all trials instead, you can set data_dir="/some/shared/directory"
.
Q2: Where is the checkpoint?
You also mentioned that torch.save
is happening, but no ckpt.pt
file is found. This is because we’re reporting the checkpoint file to Tune directly in Tune’s checkpointing logic:
# From https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html
torch.save(
(net.state_dict(), optimizer.state_dict()), "my_model/checkpoint.pt"
)
checkpoint = Checkpoint.from_directory("my_model")
session.report(
{"loss": (val_loss / val_steps), "accuracy": correct / total},
checkpoint=checkpoint,
)
Notice that we pass the checkpoint directory my_model
that contains the .pt
file to Tune as a ray.air.Checkpoint
object. This will read in the checkpoint data and package it in a "checkpoint_0000x"
folder within the trial directory.
You can access the contents after the run with these two methods:
- ray.air.checkpoint.Checkpoint.from_directory — Ray 2.3.1
- ray.air.checkpoint.Checkpoint.as_directory — Ray 2.3.1
Example:
import os
from ray.air import Checkpoint
checkpoint: Checkpoint = Checkpoint.from_directory("/path/to/checkpoint_folder")
with checkpoint.as_directory() as d:
with open(os.path.join(d, "checkpoint.pt")) as f:
# Use the checkpoint
pass
See Ray AIR Checkpoint — Ray 2.3.1 for more details.
More resources
The Tune user guides User Guides — Ray 3.0.0.dev0 maybe useful to read through. In particular:
These defaults are definitely not obvious, so thanks for creating this question. I’ll create a Github issue to add some of these explanations to our user guides/documentation. Let me know if these work for you, and if you have any other questions.