@Yard1 thank you very much for your quick response!
I have managed to create a ray cluster with several nodes and also distribute my ray tune trials on all the available devices (I only have NVIDIA accelerators).
So what I need is actually the second part of your answer.
The task I want to accomplish boils down to hyperparameter tuning with ray lightning as described here GitHub - ray-project/ray_lightning: Pytorch Lightning Distributed Accelerators using Ray
from ray import tune
from ray_lightning import RayStrategy
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
from ray_lightning.tune import TuneReportCallback, get_tune_resources
import pytorch_lightning as pl
def train_mnist(config):
# Create your PTL model.
model = MNISTClassifier(config)
# Create the Tune Reporting Callback
metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
callbacks = [TuneReportCallback(metrics, on="validation_end")]
trainer = pl.Trainer(
max_epochs=4,
callbacks=callbacks,
strategy=RayStrategy(num_workers=4, use_gpu=False))
trainer.fit(model)
config = {
"layer_1": tune.choice([32, 64, 128]),
"layer_2": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
}
# Make sure to pass in ``resources_per_trial`` using the ``get_tune_resources`` utility.
analysis = tune.run(
train_mnist,
metric="loss",
mode="min",
config=config,
num_samples=2,
resources_per_trial=get_tune_resources(num_workers=4),
name="tune_mnist")
print("Best hyperparameters found were: ", analysis.best_config)
Since get_tune_resources
gives me a PlacementGroup
without a specified accelerator I am not sure where to specify it.
Do you have a suggestion of how I should initialize tune.run
accordingly? Or do I need to switch to tune.Tuner
?
PS.: I am not sure if this is relevant but from the PR history it seems right now that Ray 2.x
is not yet working ([experiment] update the ray lightning to 1.7 by JiahaoYao · Pull Request #222 · ray-project/ray_lightning · GitHub)