Hello,
It is my first time that I use Ray and I like it how it is helping in fine tuning a very complex models.
I am using Ray to my fine tune my model (Bert) but I keep getting the bottleneck WARNING (2021-09-29 14:51:42,804 WARNING util.py:164 -- The
start_trial operation took 2.259 s, which may be a performance bottleneck.
). I tried every possible solution I found here or on you website (btw, the FAQ section 22 (How can I avoid bottlenecks?) is very complicated to follow, It would be helpful if you add some exemple on the proposed solutions) but with no help!
And It does not stop there. In the resources_per_trial
argument, when I pass get_tune_ddp_resources(num_workers=4)
(This one I copy paste it just for testing from you gitHub here )
analysis = tune.run(
model,
metric="loss",
mode="min",
config=config,
num_samples=10,
resources_per_trial=get_tune_ddp_resources(num_workers=4),
name="tune_mnist")
I got this results and it remains “PENDING” for hours with no “RUNUNG” or anything!
Trial name | status | loc | batch_size | lr |
---|---|---|---|---|
model_0195b_00000 | PENDING | 8 | 0.0121142 | |
model_0195b_00001 | PENDING | 4 | 0.000707256 | |
model_0195b_00002 | PENDING | 4 | 0.000810177 | |
model_0195b_00003 | PENDING | 4 | 0.0134884 | |
model_0195b_00004 | PENDING | 4 | 0.000635871 | |
model_0195b_00005 | PENDING | 8 | 0.000206381 | |
model_0195b_00006 | PENDING | 8 | 0.0173723 | |
model_0195b_00007 | PENDING | 4 | 0.000157989 | |
model_0195b_00008 | PENDING | 8 | 0.000697227 | |
model_0195b_00009 | PENDING | 8 | 0.0356216 |
And when I change it to gpu (resources_per_trial={'gpu': 1}
), I get an error in the last stat and also it remains “PENDING” for hours.
Trial name | status | loc | batch_size | lr |
---|---|---|---|---|
model_278c7_00001 | PENDING | 4 | 0.0186282 | |
model_278c7_00002 | PENDING | 4 | 0.037845 | |
model_278c7_00003 | PENDING | 8 | 0.00038114 | |
model_278c7_00004 | PENDING | 4 | 0.000118839 | |
model_278c7_00005 | PENDING | 4 | 0.000281904 | |
model_278c7_00006 | PENDING | 8 | 0.00388556 | |
model_278c7_00007 | PENDING | 8 | 0.00101594 | |
model_278c7_00008 | PENDING | 8 | 0.000134712 | |
model_278c7_00009 | PENDING | 8 | 0.00624219 | |
model_278c7_00000 | ERROR | 8 | 0.000103422 |
Below is a snippet of my code. I hope it could help you find the error and where it cames from.
from ray.tune.integration.pytorch_lightning import TuneReportCallback
callback = TuneReportCallback({
"avg_val_loss": "avg_val_loss",
"avg_accuracy": "avg_accuracy"},
on="validation_end") # I am using pytorch lightning so these are the results of my model from the validation_epoch_end method
config = {
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([4, 8]),
}
def train_tune(config, epochs=N_EPOCHS, gpus=1):
model = CommentClassifier(config, n_warmup_steps=warmup_steps, n_training_steps=total_training_steps)
trainer = pl.Trainer(
max_epochs=epochs,
gpus=gpus,
progress_bar_refresh_rate=0,
callbacks=[callback],
logger=logger)
trainer.fit(model, dataloader_train, dataloader_validation)
analysis = tune.run(
model,
metric="loss",
mode="min",
config=config,
num_samples=10,
resources_per_trial={'gpu': 1}, #get_tune_ddp_resources(num_workers=4)
name="tune_mnist")
print("Best hyperparameters: ", analysis.best_config)