Hi,
I’m running Ray Tuner for a quite big reinforcement learning project, which I am sure that my environment/agent/model are working correctly. However, the CPU usage always drop to 15% for the last two process remaining which are never ending.
I am working on Mac M1 Max.
I tried adjust the priority with renice, but it does not increase the CPU usage. As an example, I get the following output;
== Status ==
Current time: 2022-08-26 08:11:22 (running for 08:56:56.14)
Memory usage on this node: 19.0/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/10 CPUs, 0/0 GPUs, 0.0/16.53 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/dominikrichard/ray_results/train_eval_2022-08-25_23-14-24
Number of trials: 120/120 (2 RUNNING, 118 TERMINATED)
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
| Trial name | status | loc | lr | optimizer | pre_train | target_update_int… | iter | total time (s) | score |
|------------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------|
| train_eval_30b29_00006 | RUNNING | 127.0.0.1:18510 | 0.00025 | Adam | True | 1000 | | | |
| train_eval_30b29_00029 | RUNNING | 127.0.0.1:18513 | 5e-05 | RMS | True | 2500 | | | |
| train_eval_30b29_00000 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 500 | 1 | 1382.33 | 13 |
| train_eval_30b29_00001 | TERMINATED | 127.0.0.1:18505 | 0.0001 | Adam | True | 500 | 1 | 1381.13 | 13 |
| train_eval_30b29_00002 | TERMINATED | 127.0.0.1:18506 | 5e-05 | Adam | True | 500 | 1 | 1364 | -68 |
| train_eval_30b29_00003 | TERMINATED | 127.0.0.1:18507 | 0.00025 | RMS | True | 500 | 1 | 1453.57 | 13 |
| train_eval_30b29_00004 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 500 | 1 | 1443.43 | 10 |
| train_eval_30b29_00005 | TERMINATED | 127.0.0.1:18509 | 5e-05 | RMS | True | 500 | 1 | 1441.61 | 13 |
| train_eval_30b29_00007 | TERMINATED | 127.0.0.1:18511 | 0.0001 | Adam | True | 1000 | 1 | 1369.41 | 11 |
| train_eval_30b29_00008 | TERMINATED | 127.0.0.1:18513 | 5e-05 | Adam | True | 1000 | 1 | 1380.97 | -1 |
| train_eval_30b29_00009 | TERMINATED | 127.0.0.1:18514 | 0.00025 | RMS | True | 1000 | 1 | 1464.87 | 13 |
| train_eval_30b29_00010 | TERMINATED | 127.0.0.1:18506 | 0.0001 | RMS | True | 1000 | 1 | 1451.38 | 10 |
| train_eval_30b29_00011 | TERMINATED | 127.0.0.1:18511 | 5e-05 | RMS | True | 1000 | 1 | 1437.76 | 3 |
| train_eval_30b29_00012 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 1500 | 1 | 1389.67 | 13 |
| train_eval_30b29_00013 | TERMINATED | 127.0.0.1:18513 | 0.0001 | Adam | True | 1500 | 1 | 1382.79 | 13 |
| train_eval_30b29_00014 | TERMINATED | 127.0.0.1:18505 | 5e-05 | Adam | True | 1500 | 1 | 1384.03 | 10 |
| train_eval_30b29_00015 | TERMINATED | 127.0.0.1:18509 | 0.00025 | RMS | True | 1500 | 1 | 1447.43 | 13 |
| train_eval_30b29_00016 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 1500 | 1 | 1452.03 | 9 |
| train_eval_30b29_00017 | TERMINATED | 127.0.0.1:18507 | 5e-05 | RMS | True | 1500 | 1 | 1427.8 | 10 |
| train_eval_30b29_00018 | TERMINATED | 127.0.0.1:18514 | 0.00025 | Adam | True | 2000 | 1 | 1390.14 | 13 |
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
… 100 more trials not shown (100 TERMINATED)
== Status ==
Current time: 2022-08-26 08:11:27 (running for 08:57:01.16)
Memory usage on this node: 19.1/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/10 CPUs, 0/0 GPUs, 0.0/16.53 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/dominikrichard/ray_results/train_eval_2022-08-25_23-14-24
Number of trials: 120/120 (2 RUNNING, 118 TERMINATED)
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
| Trial name | status | loc | lr | optimizer | pre_train | target_update_int… | iter | total time (s) | score |
|------------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------|
| train_eval_30b29_00006 | RUNNING | 127.0.0.1:18510 | 0.00025 | Adam | True | 1000 | | | |
| train_eval_30b29_00029 | RUNNING | 127.0.0.1:18513 | 5e-05 | RMS | True | 2500 | | | |
| train_eval_30b29_00000 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 500 | 1 | 1382.33 | 13 |
| train_eval_30b29_00001 | TERMINATED | 127.0.0.1:18505 | 0.0001 | Adam | True | 500 | 1 | 1381.13 | 13 |
| train_eval_30b29_00002 | TERMINATED | 127.0.0.1:18506 | 5e-05 | Adam | True | 500 | 1 | 1364 | -68 |
| train_eval_30b29_00003 | TERMINATED | 127.0.0.1:18507 | 0.00025 | RMS | True | 500 | 1 | 1453.57 | 13 |
| train_eval_30b29_00004 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 500 | 1 | 1443.43 | 10 |
| train_eval_30b29_00005 | TERMINATED | 127.0.0.1:18509 | 5e-05 | RMS | True | 500 | 1 | 1441.61 | 13 |
| train_eval_30b29_00007 | TERMINATED | 127.0.0.1:18511 | 0.0001 | Adam | True | 1000 | 1 | 1369.41 | 11 |
| train_eval_30b29_00008 | TERMINATED | 127.0.0.1:18513 | 5e-05 | Adam | True | 1000 | 1 | 1380.97 | -1 |
| train_eval_30b29_00009 | TERMINATED | 127.0.0.1:18514 | 0.00025 | RMS | True | 1000 | 1 | 1464.87 | 13 |
| train_eval_30b29_00010 | TERMINATED | 127.0.0.1:18506 | 0.0001 | RMS | True | 1000 | 1 | 1451.38 | 10 |
| train_eval_30b29_00011 | TERMINATED | 127.0.0.1:18511 | 5e-05 | RMS | True | 1000 | 1 | 1437.76 | 3 |
| train_eval_30b29_00012 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 1500 | 1 | 1389.67 | 13 |
| train_eval_30b29_00013 | TERMINATED | 127.0.0.1:18513 | 0.0001 | Adam | True | 1500 | 1 | 1382.79 | 13 |
| train_eval_30b29_00014 | TERMINATED | 127.0.0.1:18505 | 5e-05 | Adam | True | 1500 | 1 | 1384.03 | 10 |
| train_eval_30b29_00015 | TERMINATED | 127.0.0.1:18509 | 0.00025 | RMS | True | 1500 | 1 | 1447.43 | 13 |
| train_eval_30b29_00016 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 1500 | 1 | 1452.03 | 9 |
| train_eval_30b29_00017 | TERMINATED | 127.0.0.1:18507 | 5e-05 | RMS | True | 1500 | 1 | 1427.8 | 10 |
| train_eval_30b29_00018 | TERMINATED | 127.0.0.1:18514 | 0.00025 | Adam | True | 2000 | 1 | 1390.14 | 13 |
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
… 100 more trials not shown (100 TERMINATED)
You can see there’s only two process not terminated, and my top window gives me the following;
ID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP
18513 python3.10 16.5 02:10:47 56 1 151 678M 160K 417M 18475
18510 python3.10 16.5 73:45.21 56 1 147 455M 160K 202M 184
This is the code i got for using Tuner;
def main():
“”“Launch the tuner
“””
search_space={
‘lr’:tune.grid_search([0.00025, 0.0001, ]),
‘optimizer’:tune.grid_search([‘Adam’,‘RMS’]),
‘pre_train’:tune.grid_search([True]),
‘target_update_interval’:tune.grid_search([i*500 for i in range(1,21)])
}
tuner=tune.Tuner(
trainable=train_eval,
param_space=search_space
)
results=tuner.fit()
print(results.get_best_result(metric=“score”, mode=“min”).config)
Where train_eval is my RL function.
My question is why is it slowing down so much on last process? How do I get to see the other 100 process results?
Thanks