One thing I’ve encountered a few times is that when I have many trials, and there are two or one remaining, the trials slow down significantly. Looking at the command prompt output of tune.run
, I usually see something like this over and over.
2021-01-26 22:39:20,403 WARNING util.py:152 -- Checkpointing the experiment
state took 22.001 s, which may be a performance bottleneck. Please ensure the
`TUNE_GLOBAL_CHECKPOINT_S` environment variable is something significantly
higher than this duration to ensure compute time is mostly spent on the main
training loop.
== Status ==
Memory usage on this node: 7.9/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/40 CPUs, 0.25/4 GPUs, 0.0/77.93 GiB heap, 0.0/25.78 GiB objects
Result logdir: /media/sdb1/asivara/ray_results/train_model_2021-01-26_02-13-11
Number of trials: 20/20 (1 RUNNING, 19 TERMINATED)
+-------------------------+------------+---------------------+--------+------------------+---------------+----------+
| Trial name | status | loc | iter | total time (s) | num_batches | vl_loss |
|-------------------------+------------+---------------------+--------+------------------+---------------+----------|
| train_model_09d96_00018 | RUNNING | 129.79.247.50:33249 | 844 | 67897.5 | 1080320 | -9.70047 |
| train_model_09d96_00000 | TERMINATED | | 922 | 10478.1 | 1049600 | -9.215 |
| train_model_09d96_00001 | TERMINATED | | 916 | 10461 | 1041920 | -9.09443 |
| train_model_09d96_00002 | TERMINATED | | 1093 | 28296 | 1268480 | -10.0238 |
| train_model_09d96_00003 | TERMINATED | | 532 | 4626.92 | 550400 | -8.72396 |
| train_model_09d96_00004 | TERMINATED | | 973 | 12028.5 | 1114880 | -8.9357 |
| train_model_09d96_00005 | TERMINATED | | 808 | 7752.65 | 903680 | -8.85031 |
| train_model_09d96_00006 | TERMINATED | | 634 | 5617.79 | 680960 | -9.5968 |
| train_model_09d96_00007 | TERMINATED | | 1194 | 47338.4 | 1397760 | -9.36708 |
| train_model_09d96_00008 | TERMINATED | | 713 | 6436.25 | 782080 | -9.0252 |
| train_model_09d96_00009 | TERMINATED | | 1082 | 27791.2 | 1254400 | -9.37948 |
| train_model_09d96_00010 | TERMINATED | | 505 | 4401.81 | 515840 | -8.3993 |
| train_model_09d96_00011 | TERMINATED | | 866 | 8556.44 | 977920 | -8.7313 |
| train_model_09d96_00012 | TERMINATED | | 696 | 6293.08 | 760320 | -9.72582 |
| train_model_09d96_00013 | TERMINATED | | 894 | 9057.28 | 1013760 | -9.14824 |
| train_model_09d96_00014 | TERMINATED | | 772 | 7177.99 | 857600 | -9.08479 |
| train_model_09d96_00015 | TERMINATED | | 1084 | 28946.9 | 1256960 | -8.97886 |
| train_model_09d96_00016 | TERMINATED | | 897 | 63878.7 | 1017600 | -8.78988 |
| train_model_09d96_00017 | TERMINATED | | 880 | 61552.9 | 995840 | -9.09118 |
| train_model_09d96_00019 | TERMINATED | | 549 | 44362.1 | 572160 | -9.16119 |
+-------------------------+------------+---------------------+--------+------------------+---------------+----------+
2021-01-26 22:39:20,414 WARNING ray_trial_executor.py:593 -- Over the last 60
seconds, the Tune event loop has been backlogged procesing new results.
Consider increasing your period of result reporting to improve performance.
2021-01-26 22:39:42,365 WARNING util.py:152 -- Checkpointing the experiment
state took 21.948 s, which may be a performance bottleneck. Please ensure the
`TUNE_GLOBAL_CHECKPOINT_S` environment variable is something significantly
higher than this duration to ensure compute time is mostly spent on the main
training loop.
Do I need to actually set this environment variable? Why is that needed? Is there any way I can have tune manually clear its results cache? Can the checkpoint saving be done asynchronously (or only when the validation loss is at an all-time low) so that way it doesn’t slow down every loop by 20 seconds?
P.S. as an added question, is there any way to get rid of the loc
column, it just takes up a lot of horizontal space and I don’t have any need for it myself.
Appreciate all the help!