How to debug performance bottlenecks

LucaCappelletti94 · January 13, 2021, 1:14pm

I do not understand why are there these bottlenecks. What are the usual suspects?

I am using BO with ASHA.
I am passing the training data through the config, is that wrong?

I see that the ray results folder is now 3.5 GBs. Is the config stored as part of the experiment state? Is it possible to avoid storing it?

The issue seems to be extremely similar to this issue on the Ray GitHub repository, but it looks like it does not longer apply with the current version of Ray.

Now if I try to add the parameter global_checkpoint_period=np.inf, I get the exception:

ValueError: global_checkpoint_period is deprecated. Set env var 'TUNE_GLOBAL_CHECKPOINT_S' instead.

Thanks!

|2021-01-13 13:08:00,904|WARN util.py:142 -- The `callbacks.on_trial_result` operation took 5.620 s, which may be a performance bottleneck.|
|---|---|
|2021-01-13 13:08:00,918|WARN util.py:142 -- The `process_trial` operation took 5.644 s, which may be a performance bottleneck.|
|2021-01-13 13:08:03,662|WARN util.py:142 -- The `experiment_checkpoint` operation took 2.743 s, which may be a performance bottleneck.|

kai · January 13, 2021, 2:37pm

How many trials are you running? Yes, the config is stored in the experiment state. Do you use it to transfer data to the trainables?

If so, it might be good to look into tune.with_parameters instead: Execution (tune.run, tune.Experiment) — Ray v1.2.0.dev0

We’re working on improving error messages for the bottlenecks. The callbacks.on_trial_result usually means that logging a single result takes a long time. If there’s data in the config, that might be the reason.
The process_trial operation includes the callback function, so if one warns, the other one warns, too.

Experiment checkpointing also writes trial configs, so again, if there is much data in them, this might be the reason. Other reasons for slow experiment checkpointing can be a large number of trials. We’re working on resolving the latter problem.

It is currently not possible to not store the trial config in the checkpoints.

LucaCappelletti94 · January 13, 2021, 2:41pm

Hello @kai and thank you for your answer. At this time I am still trying to understand how to configure how to run everything with the proper configuration so I am using only 10 iterations.

Yes, I am using the config to transfer data to the loss function.

I will look into the method you have proposed, it definitely looks like what I need, right now I am pickling and de-pickling the arguments to avoid the checkpoint issue.

LucaCappelletti94 · January 13, 2021, 2:45pm

It does the trick! Thanks!

I will mention the solution in the related GitHub issue.

kai · January 13, 2021, 2:45pm

Awesome, glad to hear that!

LucaCappelletti94 · January 15, 2021, 11:23am

Now that I have simplified that training process I still get the warning from above, just now it warns for time deltas of around 0.8 seconds. What can I possibly do to further reduce the performance bottleneck?

LucaCappelletti94 · January 15, 2021, 11:50am

I see that the slow down is very significant when the Bayesian optimization process moves from the initial random search to the actual BO. I see that there is almost no GPU usage for some reason during the BO itself, after computing the initial random samples.

Since this seems an unrelated problem from the initial topic, I have opened another question.

kai · March 18, 2021, 2:18pm

By the way, we added a section to the FAQ discussing bottlenecks in the documentation here: Tutorials & FAQ — Ray v2.0.0.dev0

Topic		Replies	Views
Tune Performance Bottlenecks Ray Tune	8	3607	February 8, 2021
Ray Tune event loop backlogged, slow with checkpointing Ray Tune	7	1628	September 28, 2021
Intended behaviour or smth wrong with my code? "no checkpoint for trial. Skip exploit for Trial" Ray Core	6	774	March 27, 2021
Error related to 'performance bottleneck' and 'start_trial'	0	936	September 29, 2021
[Tune] How to turn off checkpointing for testing Ray Tune	20	3132	April 18, 2023

How to debug performance bottlenecks

Related topics