Error related to 'performance bottleneck' and 'start_trial'

BttMA · September 29, 2021, 3:15pm

Hello,
It is my first time that I use Ray and I like it how it is helping in fine tuning a very complex models.
I am using Ray to my fine tune my model (Bert) but I keep getting the bottleneck WARNING (2021-09-29 14:51:42,804 WARNING util.py:164 -- The start_trial operation took 2.259 s, which may be a performance bottleneck.). I tried every possible solution I found here or on you website (btw, the FAQ section 22 (How can I avoid bottlenecks?) is very complicated to follow, It would be helpful if you add some exemple on the proposed solutions) but with no help!

And It does not stop there. In the resources_per_trial argument, when I pass get_tune_ddp_resources(num_workers=4) (This one I copy paste it just for testing from you gitHub here )

analysis = tune.run(
        model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=10,
        resources_per_trial=get_tune_ddp_resources(num_workers=4),
        name="tune_mnist")

I got this results and it remains “PENDING” for hours with no “RUNUNG” or anything!

Trial name	status	batch_size	lr
model_0195b_00000	PENDING	8	0.0121142
model_0195b_00001	PENDING	4	0.000707256
model_0195b_00002	PENDING	4	0.000810177
model_0195b_00003	PENDING	4	0.0134884
model_0195b_00004	PENDING	4	0.000635871
model_0195b_00005	PENDING	8	0.000206381
model_0195b_00006	PENDING	8	0.0173723
model_0195b_00007	PENDING	4	0.000157989
model_0195b_00008	PENDING	8	0.000697227
model_0195b_00009	PENDING	8	0.0356216

And when I change it to gpu (resources_per_trial={'gpu': 1}), I get an error in the last stat and also it remains “PENDING” for hours.

Trial name	status	batch_size	lr
model_278c7_00001	PENDING	4	0.0186282
model_278c7_00002	PENDING	4	0.037845
model_278c7_00003	PENDING	8	0.00038114
model_278c7_00004	PENDING	4	0.000118839
model_278c7_00005	PENDING	4	0.000281904
model_278c7_00006	PENDING	8	0.00388556
model_278c7_00007	PENDING	8	0.00101594
model_278c7_00008	PENDING	8	0.000134712
model_278c7_00009	PENDING	8	0.00624219
model_278c7_00000	ERROR	8	0.000103422

Below is a snippet of my code. I hope it could help you find the error and where it cames from.

from ray.tune.integration.pytorch_lightning import TuneReportCallback
callback = TuneReportCallback({
    "avg_val_loss": "avg_val_loss", 
    "avg_accuracy": "avg_accuracy"},
    on="validation_end") # I am using pytorch lightning so these are the results of my model from the validation_epoch_end method

config = {
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([4, 8]),
}

def train_tune(config, epochs=N_EPOCHS, gpus=1):
  model = CommentClassifier(config, n_warmup_steps=warmup_steps, n_training_steps=total_training_steps)

  trainer = pl.Trainer(
    max_epochs=epochs,
    gpus=gpus,
    progress_bar_refresh_rate=0,
    callbacks=[callback],
    logger=logger)
  
  trainer.fit(model, dataloader_train, dataloader_validation)

analysis = tune.run(
        model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=10,
        resources_per_trial={'gpu': 1}, #get_tune_ddp_resources(num_workers=4)
        name="tune_mnist")

print("Best hyperparameters: ", analysis.best_config)

Topic		Replies	Views
Ray Tune gets stuck for infinity Ray Tune	7	37	May 5, 2025
Ray tune self terminates at 98 trials consistently Ray Tune	12	1359	March 15, 2023
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	379	October 18, 2021
Trial with unexpected good status encountered: PENDING	11	584	May 19, 2023
RuntimeError: No CUDA GPUs are available Ray Tune	12	14864	February 3, 2023

Error related to 'performance bottleneck' and 'start_trial'

Related topics