Error related to 'performance bottleneck' and 'start_trial'

Hello,
It is my first time that I use Ray and I like it how it is helping in fine tuning a very complex models.
I am using Ray to my fine tune my model (Bert) but I keep getting the bottleneck WARNING (2021-09-29 14:51:42,804 WARNING util.py:164 -- The start_trial operation took 2.259 s, which may be a performance bottleneck.). I tried every possible solution I found here or on you website (btw, the FAQ section 22 (How can I avoid bottlenecks?) is very complicated to follow, It would be helpful if you add some exemple on the proposed solutions) but with no help!

And It does not stop there. In the resources_per_trial argument, when I pass get_tune_ddp_resources(num_workers=4) (This one I copy paste it just for testing from you gitHub here )

analysis = tune.run(
        model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=10,
        resources_per_trial=get_tune_ddp_resources(num_workers=4),
        name="tune_mnist")

I got this results and it remains “PENDING” for hours with no “RUNUNG” or anything!

Trial name status loc batch_size lr
model_0195b_00000 PENDING 8 0.0121142
model_0195b_00001 PENDING 4 0.000707256
model_0195b_00002 PENDING 4 0.000810177
model_0195b_00003 PENDING 4 0.0134884
model_0195b_00004 PENDING 4 0.000635871
model_0195b_00005 PENDING 8 0.000206381
model_0195b_00006 PENDING 8 0.0173723
model_0195b_00007 PENDING 4 0.000157989
model_0195b_00008 PENDING 8 0.000697227
model_0195b_00009 PENDING 8 0.0356216

And when I change it to gpu (resources_per_trial={'gpu': 1}), I get an error in the last stat and also it remains “PENDING” for hours.

Trial name status loc batch_size lr
model_278c7_00001 PENDING 4 0.0186282
model_278c7_00002 PENDING 4 0.037845
model_278c7_00003 PENDING 8 0.00038114
model_278c7_00004 PENDING 4 0.000118839
model_278c7_00005 PENDING 4 0.000281904
model_278c7_00006 PENDING 8 0.00388556
model_278c7_00007 PENDING 8 0.00101594
model_278c7_00008 PENDING 8 0.000134712
model_278c7_00009 PENDING 8 0.00624219
model_278c7_00000 ERROR 8 0.000103422

Below is a snippet of my code. I hope it could help you find the error and where it cames from.

from ray.tune.integration.pytorch_lightning import TuneReportCallback
callback = TuneReportCallback({
    "avg_val_loss": "avg_val_loss", 
    "avg_accuracy": "avg_accuracy"},
    on="validation_end") # I am using pytorch lightning so these are the results of my model from the validation_epoch_end method

config = {
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([4, 8]),
}

def train_tune(config, epochs=N_EPOCHS, gpus=1):
  model = CommentClassifier(config, n_warmup_steps=warmup_steps, n_training_steps=total_training_steps)

  trainer = pl.Trainer(
    max_epochs=epochs,
    gpus=gpus,
    progress_bar_refresh_rate=0,
    callbacks=[callback],
    logger=logger)
  
  trainer.fit(model, dataloader_train, dataloader_validation)

analysis = tune.run(
        model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=10,
        resources_per_trial={'gpu': 1}, #get_tune_ddp_resources(num_workers=4)
        name="tune_mnist")

print("Best hyperparameters: ", analysis.best_config)