Here’s a snapshot of stdout within which a trial is paused, followed by a snapshot of ray memory
Result for ray_train_pipeline_442bfafe:
date: 2021-03-01_22-30-54
done: false
experiment_id: 43ce1c6a79874822a62a93a862f5f0c5
hostname: ip-172-31-12-38.ec2.internal
iterations_since_restore: 2
node_ip: 172.31.12.38
perplexity: 43.609405517578125
pid: 249
should_checkpoint: true
time_since_restore: 2808.534305334091
time_this_iter_s: 1513.9927928447723
time_total_s: 2808.534305334091
timestamp: 1614637854
timesteps_since_restore: 0
train_loss: 4.948446750640869
training_iteration: 2
trial_id: 442bfafe
valid_loss: 3.775272846221924
2021-03-01 22:39:13,695 INFO command_runner.py:356 -- Fetched IP: 172.31.12.38
2021-03-01 22:39:13,695 INFO log_timer.py:25 -- NodeUpdater: i-0a009556544ed34ba: Got IP [LogTimer=92ms]
(pid=235, ip=172.31.7.253) 2021-03-01 22:39:15,228 INFO trainable.py:72 -- Checkpoint size is 1507372567 bytes
(autoscaler +1h16m32s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h16m38s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h16m44s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h16m50s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h16m55s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m7s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h17m12s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m18s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h17m35s) Resized to 72 CPUs, 9 GPUs.
(autoscaler +1h17m41s) Resized to 8 CPUs, 1 GPUs.
(autoscaler +1h17m46s) Resized to 72 CPUs, 9 GPUs.
2021-03-01 22:40:36,360 WARNING util.py:161 -- The `callbacks.on_trial_result` operation took 83.052 s, which may be a performance bottleneck.
2021-03-01 22:40:36,361 WARNING util.py:161 -- The `process_trial_result` operation took 83.054 s, which may be a performance bottleneck.
2021-03-01 22:40:36,361 WARNING util.py:161 -- Processing trial results took 83.054 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2021-03-01 22:40:36,361 WARNING util.py:161 -- The `process_trial` operation took 83.055 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 12.0/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 3, RUNNING: 2}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 2}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 15/48 (5 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name | status | loc | gpt2_trainer/params/learner/base_lr | gpt2_trainer/params/learner/lr_mult | gpt2_trainer/params/learner/mom | gpt2_trainer/params/learner/wd | iter | total time (s) | train_loss | valid_loss | perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING | | 1.52398e-05 | 261.524 | 0.948861 | 0.0230175 | | | | | |
| ray_train_pipeline_442bfafe | RUNNING | 172.31.12.38:249 | 0.000677805 | 130.29 | 0.854526 | 0.000234229 | 2 | 2808.53 | 4.94845 | 3.77527 | 43.6094 |
| ray_train_pipeline_442c984c | RUNNING | 172.31.12.38:248 | 0.000102905 | 148.35 | 0.716498 | 0.00411372 | 2 | 3206.56 | 7.68195 | 5.65236 | 284.964 |
| ray_train_pipeline_442ce464 | RUNNING | 172.31.7.253:234 | 0.00357978 | 105.86 | 0.762005 | 0.0382961 | 2 | 2763.07 | 4.42392 | 3.65554 | 38.6883 |
| ray_train_pipeline_442d325c | RUNNING | 172.31.7.253:195 | 0.00324221 | 304.352 | 0.864147 | 0.00886753 | 2 | 2773.79 | 4.43879 | 3.80846 | 45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING | 172.31.78.33:1209 | 0.000376895 | 162.128 | 0.850301 | 0.000227598 | 1 | 1480.65 | 8.6606 | 4.43245 | 84.1369 |
| ray_train_pipeline_4a350630 | RUNNING | | 1.76664e-05 | 276.471 | 0.972833 | 0.0523183 | | | | | |
| ray_train_pipeline_4a18b6c8 | RUNNING | | 0.0680077 | 151.236 | 0.933122 | 0.00472033 | | | | | |
| ray_train_pipeline_808046e0 | RUNNING | | 0.0137273 | 177.625 | 0.814604 | 0.0137248 | | | | | |
| ray_train_pipeline_b6a9dace | RUNNING | | 0.0234356 | 160.888 | 0.829874 | 0.000525914 | | | | | |
| ray_train_pipeline_442acde6 | PAUSED | | 0.000929089 | 66.627 | 0.807416 | 0.000112121 | 2 | 2384.94 | 4.36737 | 3.38675 | 29.5696 |
| ray_train_pipeline_442baa68 | PAUSED | | 0.0533016 | 214.393 | 0.85648 | 0.00971763 | 2 | 3078.3 | 11.9806 | 8.25782 | 3857.68 |
| ray_train_pipeline_442c4c84 | PAUSED | | 0.00442261 | 780.418 | 0.834471 | 0.0702199 | 2 | 2513.01 | 4.56404 | 3.74711 | 42.3985 |
| ray_train_pipeline_442d80cc | PAUSED | | 0.0235995 | 241.531 | 0.98323 | 0.000363078 | 2 | 2917.26 | 9.09916 | 8.01513 | 3026.4 |
| ray_train_pipeline_442dde32 | PAUSED | | 0.000310791 | 308.032 | 0.78468 | 0.0220901 | 2 | 2917.77 | 5.81892 | 5.22594 | 186.035 |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
2021-03-01 22:40:36,390 WARNING ray_trial_executor.py:658 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.
2021-03-01 22:40:36,663 INFO command_runner.py:356 -- Fetched IP: 172.31.12.38
2021-03-01 22:40:36,663 INFO log_timer.py:25 -- NodeUpdater: i-0a009556544ed34ba: Got IP [LogTimer=85ms]
(autoscaler +1h17m52s) Resized to 40 CPUs, 5 GPUs.
(autoscaler +1h17m58s) Resized to 8 CPUs, 1 GPUs.
2021-03-01 22:40:45,470 WARNING util.py:161 -- The `process_trial_save` operation took 9.079 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 12.2/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 4, RUNNING: 1}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 3}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 16/48 (6 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name | status | loc | gpt2_trainer/params/learner/base_lr | gpt2_trainer/params/learner/lr_mult | gpt2_trainer/params/learner/mom | gpt2_trainer/params/learner/wd | iter | total time (s) | train_loss | valid_loss | perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING | | 1.52398e-05 | 261.524 | 0.948861 | 0.0230175 | | | | | |
| ray_train_pipeline_442c984c | RUNNING | 172.31.12.38:248 | 0.000102905 | 148.35 | 0.716498 | 0.00411372 | 2 | 3206.56 | 7.68195 | 5.65236 | 284.964 |
| ray_train_pipeline_442ce464 | RUNNING | 172.31.7.253:234 | 0.00357978 | 105.86 | 0.762005 | 0.0382961 | 2 | 2763.07 | 4.42392 | 3.65554 | 38.6883 |
| ray_train_pipeline_442d325c | RUNNING | 172.31.7.253:195 | 0.00324221 | 304.352 | 0.864147 | 0.00886753 | 2 | 2773.79 | 4.43879 | 3.80846 | 45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING | 172.31.78.33:1209 | 0.000376895 | 162.128 | 0.850301 | 0.000227598 | 1 | 1480.65 | 8.6606 | 4.43245 | 84.1369 |
| ray_train_pipeline_4a350630 | RUNNING | | 1.76664e-05 | 276.471 | 0.972833 | 0.0523183 | | | | | |
| ray_train_pipeline_4a18b6c8 | RUNNING | | 0.0680077 | 151.236 | 0.933122 | 0.00472033 | | | | | |
| ray_train_pipeline_808046e0 | RUNNING | | 0.0137273 | 177.625 | 0.814604 | 0.0137248 | | | | | |
| ray_train_pipeline_b6a9dace | RUNNING | | 0.0234356 | 160.888 | 0.829874 | 0.000525914 | | | | | |
| ray_train_pipeline_23e99822 | RUNNING | | 0.050509 | 92.9954 | 0.74319 | 0.000680771 | | | | | |
| ray_train_pipeline_442acde6 | PAUSED | | 0.000929089 | 66.627 | 0.807416 | 0.000112121 | 2 | 2384.94 | 4.36737 | 3.38675 | 29.5696 |
| ray_train_pipeline_442baa68 | PAUSED | | 0.0533016 | 214.393 | 0.85648 | 0.00971763 | 2 | 3078.3 | 11.9806 | 8.25782 | 3857.68 |
| ray_train_pipeline_442bfafe | PAUSED | | 0.000677805 | 130.29 | 0.854526 | 0.000234229 | 2 | 2808.53 | 4.94845 | 3.77527 | 43.6094 |
| ray_train_pipeline_442c4c84 | PAUSED | | 0.00442261 | 780.418 | 0.834471 | 0.0702199 | 2 | 2513.01 | 4.56404 | 3.74711 | 42.3985 |
| ray_train_pipeline_442d80cc | PAUSED | | 0.0235995 | 241.531 | 0.98323 | 0.000363078 | 2 | 2917.26 | 9.09916 | 8.01513 | 3026.4 |
| ray_train_pipeline_442dde32 | PAUSED | | 0.000310791 | 308.032 | 0.78468 | 0.0220901 | 2 | 2917.77 | 5.81892 | 5.22594 | 186.035 |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
2021-03-01 22:40:46,023 INFO command_runner.py:356 -- Fetched IP: 172.31.7.253
2021-03-01 22:40:46,024 INFO log_timer.py:25 -- NodeUpdater: i-0be5a5283ceb78e4c: Got IP [LogTimer=104ms]
(pid=249, ip=172.31.12.38) 2021-03-01 22:40:47,860 INFO trainable.py:72 -- Checkpoint size is 1507372567 bytes
(autoscaler +1h18m3s) Resized to 104 CPUs, 13 GPUs.
2021-03-01 22:40:56,909 WARNING util.py:161 -- The `process_trial_save` operation took 11.151 s, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 11.9/60.0 GiB
Using HyperBand: num_stopped=0 total_brackets=6
Round #0:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=44.4%): {PAUSED: 4, RUNNING: 1}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=33.3%): {RUNNING: 3}
Round #1:
Bracket(Max Size (n)=5, Milestone (r)=2, completed=27.8%): {PAUSED: 2, RUNNING: 3}
Bracket(Max Size (n)=3, Milestone (r)=6, completed=0.0%): {RUNNING: 3}
Resources requested: 10.0/104 CPUs, 10.0/13 GPUs, 0.0/539.21 GiB heap, 0.0/161.13 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 442acde6 with valid_loss=3.3867452144622803 and parameters={'gpt2_trainer': {'params': {'learner': {'base_lr': 0.000929089076847593, 'wd': 0.00011212077937010874, 'mom': 0.8074159569205169, 'lr_mult': 66.62697819498219}}}}
Result logdir: /root/ray_results/jimt3-hparam-finder-2021-03-01-21-22
Number of trials: 16/48 (6 PAUSED, 10 RUNNING)
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
| Trial name | status | loc | gpt2_trainer/params/learner/base_lr | gpt2_trainer/params/learner/lr_mult | gpt2_trainer/params/learner/mom | gpt2_trainer/params/learner/wd | iter | total time (s) | train_loss | valid_loss | perplexity |
|-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------|
| ray_train_pipeline_442b5d24 | RUNNING | | 1.52398e-05 | 261.524 | 0.948861 | 0.0230175 | | | | | |
| ray_train_pipeline_442c984c | RUNNING | 172.31.12.38:248 | 0.000102905 | 148.35 | 0.716498 | 0.00411372 | 2 | 3206.56 | 7.68195 | 5.65236 | 284.964 |
| ray_train_pipeline_442ce464 | RUNNING | 172.31.7.253:234 | 0.00357978 | 105.86 | 0.762005 | 0.0382961 | 2 | 2763.07 | 4.42392 | 3.65554 | 38.6883 |
| ray_train_pipeline_442d325c | RUNNING | 172.31.7.253:195 | 0.00324221 | 304.352 | 0.864147 | 0.00886753 | 2 | 2773.79 | 4.43879 | 3.80846 | 45.0807 |
| ray_train_pipeline_d3dfbe88 | RUNNING | 172.31.78.33:1209 | 0.000376895 | 162.128 | 0.850301 | 0.000227598 | 1 | 1480.65 | 8.6606 | 4.43245 | 84.1369 |
| ray_train_pipeline_4a350630 | RUNNING | | 1.76664e-05 | 276.471 | 0.972833 | 0.0523183 | | | | | |
| ray_train_pipeline_4a18b6c8 | RUNNING | | 0.0680077 | 151.236 | 0.933122 | 0.00472033 | | | | | |
| ray_train_pipeline_808046e0 | RUNNING | | 0.0137273 | 177.625 | 0.814604 | 0.0137248 | | | | | |
| ray_train_pipeline_b6a9dace | RUNNING | | 0.0234356 | 160.888 | 0.829874 | 0.000525914 | | | | | |
| ray_train_pipeline_23e99822 | RUNNING | | 0.050509 | 92.9954 | 0.74319 | 0.000680771 | | | | | |
| ray_train_pipeline_442acde6 | PAUSED | | 0.000929089 | 66.627 | 0.807416 | 0.000112121 | 2 | 2384.94 | 4.36737 | 3.38675 | 29.5696 |
| ray_train_pipeline_442baa68 | PAUSED | | 0.0533016 | 214.393 | 0.85648 | 0.00971763 | 2 | 3078.3 | 11.9806 | 8.25782 | 3857.68 |
| ray_train_pipeline_442bfafe | PAUSED | | 0.000677805 | 130.29 | 0.854526 | 0.000234229 | 2 | 2808.53 | 4.94845 | 3.77527 | 43.6094 |
| ray_train_pipeline_442c4c84 | PAUSED | | 0.00442261 | 780.418 | 0.834471 | 0.0702199 | 2 | 2513.01 | 4.56404 | 3.74711 | 42.3985 |
| ray_train_pipeline_442d80cc | PAUSED | | 0.0235995 | 241.531 | 0.98323 | 0.000363078 | 2 | 2917.26 | 9.09916 | 8.01513 | 3026.4 |
| ray_train_pipeline_442dde32 | PAUSED | | 0.000310791 | 308.032 | 0.78468 | 0.0220901 | 2 | 2917.77 | 5.81892 | 5.22594 | 186.035 |
+-----------------------------+----------+-------------------+---------------------------------------+---------------------------------------+-----------------------------------+----------------------------------+--------+------------------+--------------+--------------+--------------+
(pid=195, ip=172.31.7.253) [1, 4.4387922286987305, 3.808455228805542, 45.080745697021484, '28:46']
And here’s a snapshot of ray memory
at that point in time
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE 7fabc23126d27f51d56d800cbde6ca1428bc4d7f0100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE d68fec326c8433c9c6953afc4a9f69e91488ca7c0100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE 19f1295112af6a5642867781e3b6e074ed2613070100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE 764f8475f05bf0014e2ab276f14c37c2f653b6c20100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE 9e8c0eaa9bab673c5497aa04f981e4a162a3bc850100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE a56654db7f924b5611ac9524461283ce392476880100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61
172.31.78.33 1169 Driver (actor call) | /opt/c 1507372570 B LOCAL_REFERENCE 7d0a4228a9578dfee004d35e3a7c2cb2324fe6f20100000001000000
onda/lib/python3.8/sit
e-packages/ray/tune/ra
y_trial_executor.py:sa
ve:909 | /opt/conda/li
b/python3.8/site-packa
ges/ray/tune/trial_exe
cutor.py:pause_trial:1
23 | /opt/conda/lib/py
thon3.8/site-packages/
ray/tune/ray_trial_exe
cutor.py:pause_trial:5
61