I am a new user to ray tune
I’ve been encountering multiple issues while attempting to use Ray Tune for hyperparameter tuning in my PyTorch project. Despite following the official documentation and examples, I’m running into errors primarily related to tune.report()
not being recognized or causing unexpected behavior. For instance, I receive errors indicating that the specified metrics were not included in the trial results expected by AsyncHyperBandScheduler
, even though I’m using tune.report()
to report metrics like loss.
My environment setup is Ray version 2.10.0 on a Linux system with PyTorch for model training. I’ve attempted to adjust my code based on suggestions found in Ray’s GitHub issues and forums, including setting PYTHONPATH
and ensuring my custom modules are correctly imported, but these issues persist. I’m wondering if there have been recent changes to tune.report()
or other aspects of Ray Tune’s API that I might have missed, or if there are specific configurations and practices I should follow to resolve these errors. Any guidance or recommendations from the community would be greatly appreciated.
my ray import statements
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.hyperopt import HyperOptSearch
from ray.tune import CLIReporter
train_model function
def train_model(config):
device = "cuda"
train_dataloader = create_train_dataloader(arguments, config)
vae = VAE(config, OH_len=10, OH_in_decoder = arguments['OH_in_decoder']) # Ensure the model class VAE can accept these hyperparameters
optimizer = torch.optim.Adam(vae.parameters(), lr=config["lr"], weight_decay=config["weight_decay"], betas=(config["beta1"], config["beta2"]))
vae.to(device)
# print(vae)
total_loss, total_KL_loss, total_SSIM_loss = 0, 0, 0 # Initialize variables
# Your training loop here
for epoch in range(config["epochs"]):
vae.train()
train_loss = 0 # Initialization of train_loss
for x, one_hot in train_dataloader:
x, one_hot = x.to(device), one_hot.to(device)
x_hat = vae(x, one_hot)
# Get loss with loss function
loss, KL_loss, SSIM_loss = get_kl_ssim_loss(x, x_hat, vae.encoder.sigma, vae.encoder.mu, arguments)
# Backward pass / weights modification
optimizer.zero_grad()
loss.backward()
optimizer.step()
# print('\t Single batch train loss: %f' % (loss.item()))
train_loss += loss.item()
total_KL_loss += KL_loss.item()
total_SSIM_loss += SSIM_loss.item()
total_loss += train_loss
num_samples = len(train_dataloader.dataset)
avg_loss = total_loss / num_samples
avg_KL_loss = total_KL_loss / num_samples
avg_SSIM_loss = total_SSIM_loss / num_samples
tune.report(loss=avg_loss, KL_loss=avg_KL_loss, SSIM_loss=avg_SSIM_loss)
main function
if __name__ == "__main__":
ray.init()
space = {
...
}
# Initialize the search algorithm with the search space
search_alg = HyperOptSearch(space, metric="loss", mode="min")
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=10,
grace_period=1,
reduction_factor=2
)
# Call tune.run without 'config' when using 'search_alg'
result = tune.run(
train_model,
resources_per_trial={"cpu": 2, "gpu": 1},
num_samples=10,
scheduler=scheduler,
progress_reporter=CLIReporter(metric_columns=["loss", "training_iteration"]),
search_alg=search_alg
)