Ray Tune event loop backlogged, slow with checkpointing

One thing I’ve encountered a few times is that when I have many trials, and there are two or one remaining, the trials slow down significantly. Looking at the command prompt output of tune.run, I usually see something like this over and over.

2021-01-26 22:39:20,403 WARNING util.py:152 -- Checkpointing the experiment
state took 22.001 s, which may be a performance bottleneck. Please ensure the
`TUNE_GLOBAL_CHECKPOINT_S` environment variable is something significantly
higher than this duration to ensure compute time is mostly spent on the main
training loop.
== Status ==
Memory usage on this node: 7.9/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/40 CPUs, 0.25/4 GPUs, 0.0/77.93 GiB heap, 0.0/25.78 GiB objects
Result logdir: /media/sdb1/asivara/ray_results/train_model_2021-01-26_02-13-11
Number of trials: 20/20 (1 RUNNING, 19 TERMINATED)
+-------------------------+------------+---------------------+--------+------------------+---------------+----------+
| Trial name              | status     | loc                 |   iter |   total time (s) |   num_batches |  vl_loss |
|-------------------------+------------+---------------------+--------+------------------+---------------+----------|
| train_model_09d96_00018 | RUNNING    | 129.79.247.50:33249 |    844 |         67897.5  |       1080320 | -9.70047 |
| train_model_09d96_00000 | TERMINATED |                     |    922 |         10478.1  |       1049600 | -9.215   |
| train_model_09d96_00001 | TERMINATED |                     |    916 |         10461    |       1041920 | -9.09443 |
| train_model_09d96_00002 | TERMINATED |                     |   1093 |         28296    |       1268480 | -10.0238 |
| train_model_09d96_00003 | TERMINATED |                     |    532 |          4626.92 |        550400 | -8.72396 |
| train_model_09d96_00004 | TERMINATED |                     |    973 |         12028.5  |       1114880 | -8.9357  |
| train_model_09d96_00005 | TERMINATED |                     |    808 |          7752.65 |        903680 | -8.85031 |
| train_model_09d96_00006 | TERMINATED |                     |    634 |          5617.79 |        680960 | -9.5968  |
| train_model_09d96_00007 | TERMINATED |                     |   1194 |         47338.4  |       1397760 | -9.36708 |
| train_model_09d96_00008 | TERMINATED |                     |    713 |          6436.25 |        782080 | -9.0252  |
| train_model_09d96_00009 | TERMINATED |                     |   1082 |         27791.2  |       1254400 | -9.37948 |
| train_model_09d96_00010 | TERMINATED |                     |    505 |          4401.81 |        515840 | -8.3993  |
| train_model_09d96_00011 | TERMINATED |                     |    866 |          8556.44 |        977920 | -8.7313  |
| train_model_09d96_00012 | TERMINATED |                     |    696 |          6293.08 |        760320 | -9.72582 |
| train_model_09d96_00013 | TERMINATED |                     |    894 |          9057.28 |       1013760 | -9.14824 |
| train_model_09d96_00014 | TERMINATED |                     |    772 |          7177.99 |        857600 | -9.08479 |
| train_model_09d96_00015 | TERMINATED |                     |   1084 |         28946.9  |       1256960 | -8.97886 |
| train_model_09d96_00016 | TERMINATED |                     |    897 |         63878.7  |       1017600 | -8.78988 |
| train_model_09d96_00017 | TERMINATED |                     |    880 |         61552.9  |        995840 | -9.09118 |
| train_model_09d96_00019 | TERMINATED |                     |    549 |         44362.1  |        572160 | -9.16119 |
+-------------------------+------------+---------------------+--------+------------------+---------------+----------+

2021-01-26 22:39:20,414 WARNING ray_trial_executor.py:593 -- Over the last 60
seconds, the Tune event loop has been backlogged procesing new results.
Consider increasing your period of result reporting to improve performance.

2021-01-26 22:39:42,365 WARNING util.py:152 -- Checkpointing the experiment
state took 21.948 s, which may be a performance bottleneck. Please ensure the
`TUNE_GLOBAL_CHECKPOINT_S` environment variable is something significantly
higher than this duration to ensure compute time is mostly spent on the main
training loop.

Do I need to actually set this environment variable? Why is that needed? Is there any way I can have tune manually clear its results cache? Can the checkpoint saving be done asynchronously (or only when the validation loss is at an all-time low) so that way it doesn’t slow down every loop by 20 seconds?

P.S. as an added question, is there any way to get rid of the loc column, it just takes up a lot of horizontal space and I don’t have any need for it myself.

Appreciate all the help!

@kai actually has a PR to address this in the next release (1.3) – right now the proposal moving forward is to making this a bit “smarter” and not checkpointing to frequently.

1 Like

Oh! Is this the case with 2.0.0.dev0?

$ pip3 show ray
Name: ray
Version: 2.0.0.dev0
Summary: Ray provides a simple, universal API for building distributed applications.
Home-page: https://github.com/ray-project/ray
Author: Ray Team
Author-email: ray-dev@googlegroups.com
License: Apache 2.0
Location: /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages
Requires: aiohttp, aiohttp-cors, colorful, msgpack, requests, redis, grpcio, jsonschema, pyyaml, opencensus, gpustat, protobuf, prometheus-client, colorama, aioredis, numpy, py-spy, click, filelock

Oh wait, do I have some version ahead of “the next release”?

The PR isn’t merged yet, so it’s unlikely you have it :slight_smile:

But it will be included in nightly (2.0.0.dev0) soon.

1 Like

Hey @actuallyaswin, it would be great if you could share a bit of your running code. It seems you are just running 20 trials, but still checkpointing takes about 20 seconds, which is a lot. How many results are your trials reporting? Do you store checkpoints to cloud storage? What does your search space configuration look like? Could you share your tune.run() call?

As Richard said, we’re currently working on improving this. However, our use cases are more about a very high number of trials (e.g. 10,000), so a case such as your is very interesting for us to make sure we tackle the right problems.

Hi @kai! Sure thing here is my config dict and the tune.run call I use. I don’t store any checkpoints in the cloud, and with my finetune_model() function, I have a model which trains over an indefinite number of epochs, and tune.report + tune.checkpoint_dir() is called every 10 epochs (that’s how often I validate). Maybe that’s too fast of a rate for some reason? If it is, maybe tune needs a way to automatically ignore the high influx of data, basically subsample / report less often as needed.

You can assume all the variables here are defined. I’m not too savvy with modifying the status table that tune.run repeatedly prints; because I want to see the snr variable showing up on the table, even though it’s only a single value, I have it wrapped under a tune.grid_search to sort of ‘hack’ the variable showing up in the status table. Here, speaker_id is a list of 20 different speakers; so ultimately tune.run runs 20 trials where each trial is for a different speaker.

  num_gpus = 1.0
  config = {
    'finetune_amount': finetune_amount,
    'model_type': model_type,
    'learning_rate': learning_rate,
    'batch_size': C.BATCH_SIZES[model_type],
    'snr': tune.grid_search([snr]),
    'speaker_id': tune.grid_search(speaker_ids),
    'save_to_local': True,
    'test': True,
  }
  # use Tune to queue up trials in parallel on the GPUs
  tune.run(
    finetune_model,
    resources_per_trial={'gpu': num_gpus},
    config=config,
    log_to_file='log.txt',
    keep_checkpoints_num=1,
    queue_trials=True,
  )

Hi and sorry for the delay.

Thanks for sharing your code.

First, for defining the output variables in the table you can use something like this:

from ray.tune import CLIReporter

reporter = CLIReporter(parameter_columns=["model_type", "snr", "batch_size"])
# ...
tune.run(
    # ...,
    progress_reporter=reporter)

Second, it seems very strange that checkpointing takes so long in your case. Previously we’ve only been bottlenecked by processing a large number of trials.

Here’s what happens during experiment checkpointing. The trial runner serializes itself and every trial that has been run and writes them to disk. Serialized trials are cached so we don’t do the same work over and over again. Bottlenecks occurred previously when dealing with a large number of trials, because that took a long time. But you only have 20 trials, which is not much at all.

So the next thing could be that serializing a single trial takes up much time. This can be the case if there are large objects in the config or in the result.

Do you pass anything other than primitives (strings, integers, floats) as variables in the config? E.g. is model_type a string?

And in your finetune_model function, what do you report to tune.run()? Does this contain anything other than primitives?