Object Spilling useful to avoid running out of memory when using Ray Tune

matthew.cox · February 25, 2021, 3:45pm

I have found object spilling to be useful to help avoid running out of memory when training a large model. I’m using Ray Tune to tune hyperparameters for a model that has about 4.5GB worth of parameters. For some reason, whenever a trial is paused, the trial is saved in the object store, and is never removed. I had a head node with about 60GB of memory that would run out of memory after about 12 trials were paused. Object spilling helps avoid this problem.

Regarding: “whenever a trial is paused, the trial is saved in the object store and is never removed” - I see this happening when running my real training scripts, but not when I’m running toy training scripts that have minimal dependencies. My real training scripts have dependencies on fast.ai and mlflow, so my theory is that mlflow has a database connection that prevents garbage collection from removing the reference.

FYI, this is a cross post from [tune] Error when spilling objects: buf_len = len(buf) - ‘NoneType’ has no len() · Issue #14079 · ray-project/ray · GitHub as we started a conversation in a closed ticket with @sangcho .

sangcho · February 25, 2021, 5:49pm

Thanks @matthew.cox!!

Have you tried running ray memory when you have the memory-leak like issue?

matthew.cox · February 25, 2021, 6:20pm

I have. I can see a number of large objects corresponding to the number of trials that have been paused at least once. Would it be helpful for me to post a snapshot of ray memory next time I run ray tune?

matthew.cox · February 25, 2021, 6:23pm

FWIW, here is an indicative snippet from ray memory: Can Tune be configured to not keep paused trials in object store?

sangcho · February 25, 2021, 8:57pm

It could be helpful, but I think this might be a Tune related issue, so @kai can probably take a look at it instead. If you already have the communication going on, please keep going the communication there!

rliaw · February 25, 2021, 10:34pm

Hmm, I think this is interesting. I’m assuming you’re running PBT here right?

matthew.cox · February 26, 2021, 7:15pm

@rliaw I am not running PBT - I’m doing a hyperparameter search with HyperBandForBOHB and TuneBOHB. What other context would be helpful to know and/or is there anything that you’d like me to try?

rliaw · February 26, 2021, 10:00pm

Hmm, ok got it. I am not sure if we terminate those runs properly… Could you post the stdout?

matthew.cox · February 26, 2021, 10:32pm

Would you like the stdout from ray memory or something else?

rliaw · February 26, 2021, 10:43pm

The actual tune run with tune.run(verbose=3) actually!

matthew.cox · March 1, 2021, 11:14pm

Here’s a snapshot of stdout within which a trial is paused, followed by a snapshot of ray memory

Result for ray_train_pipeline_442bfafe:
  date: 2021-03-01_22-30-54
  done: false
  experiment_id: 43ce1c6a79874822a62a93a862f5f0c5
  hostname: ip-172-31-12-38.ec2.internal
  iterations_since_restore: 2
  node_ip: 172.31.12.38
  perplexity: 43.609405517578125
  pid: 249
  should_checkpoint: true
  time_since_restore: 2808.534305334091
  time_this_iter_s: 1513.9927928447723
  time_total_s: 2808.534305334091
  timestamp: 1614637854
  timesteps_since_restore: 0
  train_loss: 4.948446750640869
  training_iteration: 2
  trial_id: 442bfafe
  valid_loss: 3.775272846221924

2021-03-01 22:39:13,695	INFO command_runner.py:356 -- Fetched IP: 172.31.12.38
2021-03-01 22:39:13,695	INFO log_timer.py:25 -- NodeUpdater: i-0a009556544ed34ba: Got IP  [LogTimer=92ms]
(pid=235, ip=172.31.7.253) 2021-03-01 22:39:15,228	INFO trainable.py:72 -- Checkpoint size is 1507372567 bytes
                                                                                                                   (autoscaler +1h16m32s) Resized to 72 CPUs, 9 GPUs.
                                                                                                                                                                     (autoscaler +1h16m38s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h16m44s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h16m50s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h16m55s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m7s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h17m12s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m18s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h17m35s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m41s) Resized to 8 CPUs, 1 GPUs.
(autoscaler +1h17m46s) Resized to 72 CPUs, 9 GPUs.
2021-03-01 22:40:36,360	WARNING util.py:161 -- The `callbacks.on_trial_result` operation took 83.052 s, which may be a performance bottleneck.
2021-03-01 22:40:36,361	WARNING util.py:161 -- The `process_trial_result` operation took 83.054 s, which may be a performance bottleneck.
2021-03-01 22:40:36,361	WARNING util.py:161 -- Processing trial results took 83.054 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2021-03-01 22:40:36,361	WARNING util.py:161 -- The `process_trial` operation took 83.055 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 12.0/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 3, RUNNING: 2}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 2}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 15/48 (5 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name                  | status   | loc               |   gpt2_trainer/params/learner/base_lr |   gpt2_trainer/params/learner/lr_mult |   gpt2_trainer/params/learner/mom |   gpt2_trainer/params/learner/wd |   iter |   total time (s) |   train_loss |   valid_loss |   perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING  |                   |                           1.52398e-05 |                               261.524 |                          0.948861 |                      0.0230175   |        |                  |              |              |              |
| ray_train_pipeline_442bfafe | RUNNING  | 172.31.12.38:249  |                           0.000677805 |                               130.29  |                          0.854526 |                      0.000234229 |      2 |          2808.53 |      4.94845 |      3.77527 |      43.6094 |
| ray_train_pipeline_442c984c | RUNNING  | 172.31.12.38:248  |                           0.000102905 |                               148.35  |                          0.716498 |                      0.00411372  |      2 |          3206.56 |      7.68195 |      5.65236 |     284.964  |
| ray_train_pipeline_442ce464 | RUNNING  | 172.31.7.253:234  |                           0.00357978  |                               105.86  |                          0.762005 |                      0.0382961   |      2 |          2763.07 |      4.42392 |      3.65554 |      38.6883 |
| ray_train_pipeline_442d325c | RUNNING  | 172.31.7.253:195  |                           0.00324221  |                               304.352 |                          0.864147 |                      0.00886753  |      2 |          2773.79 |      4.43879 |      3.80846 |      45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING  | 172.31.78.33:1209 |                           0.000376895 |                               162.128 |                          0.850301 |                      0.000227598 |      1 |          1480.65 |      8.6606  |      4.43245 |      84.1369 |
| ray_train_pipeline_4a350630 | RUNNING  |                   |                           1.76664e-05 |                               276.471 |                          0.972833 |                      0.0523183   |        |                  |              |              |              |
| ray_train_pipeline_4a18b6c8 | RUNNING  |                   |                           0.0680077   |                               151.236 |                          0.933122 |                      0.00472033  |        |                  |              |              |              |
| ray_train_pipeline_808046e0 | RUNNING  |                   |                           0.0137273   |                               177.625 |                          0.814604 |                      0.0137248   |        |                  |              |              |              |
| ray_train_pipeline_b6a9dace | RUNNING  |                   |                           0.0234356   |                               160.888 |                          0.829874 |                      0.000525914 |        |                  |              |              |              |
| ray_train_pipeline_442acde6 | PAUSED   |                   |                           0.000929089 |                                66.627 |                          0.807416 |                      0.000112121 |      2 |          2384.94 |      4.36737 |      3.38675 |      29.5696 |
| ray_train_pipeline_442baa68 | PAUSED   |                   |                           0.0533016   |                               214.393 |                          0.85648  |                      0.00971763  |      2 |          3078.3  |     11.9806  |      8.25782 |    3857.68   |
| ray_train_pipeline_442c4c84 | PAUSED   |                   |                           0.00442261  |                               780.418 |                          0.834471 |                      0.0702199   |      2 |          2513.01 |      4.56404 |      3.74711 |      42.3985 |
| ray_train_pipeline_442d80cc | PAUSED   |                   |                           0.0235995   |                               241.531 |                          0.98323  |                      0.000363078 |      2 |          2917.26 |      9.09916 |      8.01513 |    3026.4    |
| ray_train_pipeline_442dde32 | PAUSED   |                   |                           0.000310791 |                               308.032 |                          0.78468  |                      0.0220901   |      2 |          2917.77 |      5.81892 |      5.22594 |     186.035  |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+


2021-03-01 22:40:36,390	WARNING ray_trial_executor.py:658 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.
2021-03-01 22:40:36,663	INFO command_runner.py:356 -- Fetched IP: 172.31.12.38
2021-03-01 22:40:36,663	INFO log_timer.py:25 -- NodeUpdater: i-0a009556544ed34ba: Got IP  [LogTimer=85ms]
(autoscaler +1h17m52s) Resized to 40 CPUs, 5 GPUs.
                                                  (autoscaler +1h17m58s) Resized to 8 CPUs, 1 GPUs.
                                                                                                   2021-03-01 22:40:45,470	WARNING util.py:161 -- The `process_trial_save` operation took 9.079 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 12.2/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 4, RUNNING: 1}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 3}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 16/48 (6 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name                  | status   | loc               |   gpt2_trainer/params/learner/base_lr |   gpt2_trainer/params/learner/lr_mult |   gpt2_trainer/params/learner/mom |   gpt2_trainer/params/learner/wd |   iter |   total time (s) |   train_loss |   valid_loss |   perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING  |                   |                           1.52398e-05 |                              261.524  |                          0.948861 |                      0.0230175   |        |                  |              |              |              |
| ray_train_pipeline_442c984c | RUNNING  | 172.31.12.38:248  |                           0.000102905 |                              148.35   |                          0.716498 |                      0.00411372  |      2 |          3206.56 |      7.68195 |      5.65236 |     284.964  |
| ray_train_pipeline_442ce464 | RUNNING  | 172.31.7.253:234  |                           0.00357978  |                              105.86   |                          0.762005 |                      0.0382961   |      2 |          2763.07 |      4.42392 |      3.65554 |      38.6883 |
| ray_train_pipeline_442d325c | RUNNING  | 172.31.7.253:195  |                           0.00324221  |                              304.352  |                          0.864147 |                      0.00886753  |      2 |          2773.79 |      4.43879 |      3.80846 |      45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING  | 172.31.78.33:1209 |                           0.000376895 |                              162.128  |                          0.850301 |                      0.000227598 |      1 |          1480.65 |      8.6606  |      4.43245 |      84.1369 |
| ray_train_pipeline_4a350630 | RUNNING  |                   |                           1.76664e-05 |                              276.471  |                          0.972833 |                      0.0523183   |        |                  |              |              |              |
| ray_train_pipeline_4a18b6c8 | RUNNING  |                   |                           0.0680077   |                              151.236  |                          0.933122 |                      0.00472033  |        |                  |              |              |              |
| ray_train_pipeline_808046e0 | RUNNING  |                   |                           0.0137273   |                              177.625  |                          0.814604 |                      0.0137248   |        |                  |              |              |              |
| ray_train_pipeline_b6a9dace | RUNNING  |                   |                           0.0234356   |                              160.888  |                          0.829874 |                      0.000525914 |        |                  |              |              |              |
| ray_train_pipeline_23e99822 | RUNNING  |                   |                           0.050509    |                               92.9954 |                          0.74319  |                      0.000680771 |        |                  |              |              |              |
| ray_train_pipeline_442acde6 | PAUSED   |                   |                           0.000929089 |                               66.627  |                          0.807416 |                      0.000112121 |      2 |          2384.94 |      4.36737 |      3.38675 |      29.5696 |
| ray_train_pipeline_442baa68 | PAUSED   |                   |                           0.0533016   |                              214.393  |                          0.85648  |                      0.00971763  |      2 |          3078.3  |     11.9806  |      8.25782 |    3857.68   |
| ray_train_pipeline_442bfafe | PAUSED   |                   |                           0.000677805 |                              130.29   |                          0.854526 |                      0.000234229 |      2 |          2808.53 |      4.94845 |      3.77527 |      43.6094 |
| ray_train_pipeline_442c4c84 | PAUSED   |                   |                           0.00442261  |                              780.418  |                          0.834471 |                      0.0702199   |      2 |          2513.01 |      4.56404 |      3.74711 |      42.3985 |
| ray_train_pipeline_442d80cc | PAUSED   |                   |                           0.0235995   |                              241.531  |                          0.98323  |                      0.000363078 |      2 |          2917.26 |      9.09916 |      8.01513 |    3026.4    |
| ray_train_pipeline_442dde32 | PAUSED   |                   |                           0.000310791 |                              308.032  |                          0.78468  |                      0.0220901   |      2 |          2917.77 |      5.81892 |      5.22594 |     186.035  |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+


2021-03-01 22:40:46,023	INFO command_runner.py:356 -- Fetched IP: 172.31.7.253
2021-03-01 22:40:46,024	INFO log_timer.py:25 -- NodeUpdater: i-0be5a5283ceb78e4c: Got IP  [LogTimer=104ms]
(pid=249, ip=172.31.12.38) 2021-03-01 22:40:47,860	INFO trainable.py:72 -- Checkpoint size is 1507372567 bytes
(autoscaler +1h18m3s) Resized to 104 CPUs, 13 GPUs.
                                                   2021-03-01 22:40:56,909	WARNING util.py:161 -- The `process_trial_save` operation took 11.151 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 11.9/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 4, RUNNING: 1}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
  Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
  Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 3}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 16/48 (6 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name                  | status   | loc               |   gpt2_trainer/params/learner/base_lr |   gpt2_trainer/params/learner/lr_mult |   gpt2_trainer/params/learner/mom |   gpt2_trainer/params/learner/wd |   iter |   total time (s) |   train_loss |   valid_loss |   perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING  |                   |                           1.52398e-05 |                              261.524  |                          0.948861 |                      0.0230175   |        |                  |              |              |              |
| ray_train_pipeline_442c984c | RUNNING  | 172.31.12.38:248  |                           0.000102905 |                              148.35   |                          0.716498 |                      0.00411372  |      2 |          3206.56 |      7.68195 |      5.65236 |     284.964  |
| ray_train_pipeline_442ce464 | RUNNING  | 172.31.7.253:234  |                           0.00357978  |                              105.86   |                          0.762005 |                      0.0382961   |      2 |          2763.07 |      4.42392 |      3.65554 |      38.6883 |
| ray_train_pipeline_442d325c | RUNNING  | 172.31.7.253:195  |                           0.00324221  |                              304.352  |                          0.864147 |                      0.00886753  |      2 |          2773.79 |      4.43879 |      3.80846 |      45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING  | 172.31.78.33:1209 |                           0.000376895 |                              162.128  |                          0.850301 |                      0.000227598 |      1 |          1480.65 |      8.6606  |      4.43245 |      84.1369 |
| ray_train_pipeline_4a350630 | RUNNING  |                   |                           1.76664e-05 |                              276.471  |                          0.972833 |                      0.0523183   |        |                  |              |              |              |
| ray_train_pipeline_4a18b6c8 | RUNNING  |                   |                           0.0680077   |                              151.236  |                          0.933122 |                      0.00472033  |        |                  |              |              |              |
| ray_train_pipeline_808046e0 | RUNNING  |                   |                           0.0137273   |                              177.625  |                          0.814604 |                      0.0137248   |        |                  |              |              |              |
| ray_train_pipeline_b6a9dace | RUNNING  |                   |                           0.0234356   |                              160.888  |                          0.829874 |                      0.000525914 |        |                  |              |              |              |
| ray_train_pipeline_23e99822 | RUNNING  |                   |                           0.050509    |                               92.9954 |                          0.74319  |                      0.000680771 |        |                  |              |              |              |
| ray_train_pipeline_442acde6 | PAUSED   |                   |                           0.000929089 |                               66.627  |                          0.807416 |                      0.000112121 |      2 |          2384.94 |      4.36737 |      3.38675 |      29.5696 |
| ray_train_pipeline_442baa68 | PAUSED   |                   |                           0.0533016   |                              214.393  |                          0.85648  |                      0.00971763  |      2 |          3078.3  |     11.9806  |      8.25782 |    3857.68   |
| ray_train_pipeline_442bfafe | PAUSED   |                   |                           0.000677805 |                              130.29   |                          0.854526 |                      0.000234229 |      2 |          2808.53 |      4.94845 |      3.77527 |      43.6094 |
| ray_train_pipeline_442c4c84 | PAUSED   |                   |                           0.00442261  |                              780.418  |                          0.834471 |                      0.0702199   |      2 |          2513.01 |      4.56404 |      3.74711 |      42.3985 |
| ray_train_pipeline_442d80cc | PAUSED   |                   |                           0.0235995   |                              241.531  |                          0.98323  |                      0.000363078 |      2 |          2917.26 |      9.09916 |      8.01513 |    3026.4    |
| ray_train_pipeline_442dde32 | PAUSED   |                   |                           0.000310791 |                              308.032  |                          0.78468  |                      0.0220901   |      2 |          2917.77 |      5.81892 |      5.22594 |     186.035  |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+


(pid=195, ip=172.31.7.253) [1, 4.4387922286987305, 3.808455228805542, 45.080745697021484, '28:46']

And here’s a snapshot of ray memory at that point in time

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     7fabc23126d27f51d56d800cbde6ca1428bc4d7f0100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     d68fec326c8433c9c6953afc4a9f69e91488ca7c0100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     19f1295112af6a5642867781e3b6e074ed2613070100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     764f8475f05bf0014e2ab276f14c37c2f653b6c20100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     9e8c0eaa9bab673c5497aa04f981e4a162a3bc850100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     a56654db7f924b5611ac9524461283ce392476880100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

172.31.78.33  1169   Driver  (actor call)  | /opt/c  1507372570 B  LOCAL_REFERENCE     7d0a4228a9578dfee004d35e3a7c2cb2324fe6f20100000001000000
                             onda/lib/python3.8/sit
                             e-packages/ray/tune/ra
                             y_trial_executor.py:sa
                             ve:909 | /opt/conda/li
                             b/python3.8/site-packa
                             ges/ray/tune/trial_exe
                             cutor.py:pause_trial:1
                             23 | /opt/conda/lib/py
                             thon3.8/site-packages/
                             ray/tune/ray_trial_exe
                             cutor.py:pause_trial:5
                             61

rliaw · March 3, 2021, 1:17am

Ah I see, I think the trials get terminated at the end of each bracket rung/milestone (so it’s just a matter of time before they go from paused to terminated). I would probably suggest using ASHAScheduler instead here?

matthew.cox · March 3, 2021, 5:28pm

I have been using BOHB for my search algorithm, which I believe requires that I use HyperbandForHOHB according to the docs here: Search Algorithms (tune.suggest) — Ray v2.0.0.dev0 , is that correct? With that said, it seems like there are a few more options for Bayesian based search algorithms than the last time I looked, so I am happy to try out a different search algorithm that enables me to use ASHAScheduler instead. Thanks!

rliaw · March 4, 2021, 2:01am

Awesome - I think one possible combo would be:

ASHAScheduler
AxSearch with the ConcurrencyLimiter set to something mild (like ~10) and batch=True.

Topic		Replies	Views
Ray using so much memory I cannot even start the tuning Ray Tune	5	2431	April 24, 2023
Can Tune be configured to not keep paused trials in object store?	6	585	February 15, 2021
How do I get tune.run to handle CUDA out of memory errors? Ray Tune	11	3308	December 9, 2020
Ray Tune is Frozen, Large Number of Trials are Paused Ray Tune	1	459	October 26, 2021
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	780	February 10, 2023

Object Spilling useful to avoid running out of memory when using Ray Tune

Related topics