CPU usage drop for the last two runs. Process go sleeping

Hi,

I’m running Ray Tuner for a quite big reinforcement learning project, which I am sure that my environment/agent/model are working correctly. However, the CPU usage always drop to 15% for the last two process remaining which are never ending.

I am working on Mac M1 Max.

I tried adjust the priority with renice, but it does not increase the CPU usage. As an example, I get the following output;
== Status ==
Current time: 2022-08-26 08:11:22 (running for 08:56:56.14)
Memory usage on this node: 19.0/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/10 CPUs, 0/0 GPUs, 0.0/16.53 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/dominikrichard/ray_results/train_eval_2022-08-25_23-14-24
Number of trials: 120/120 (2 RUNNING, 118 TERMINATED)
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
| Trial name | status | loc | lr | optimizer | pre_train | target_update_int… | iter | total time (s) | score |
|------------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------|
| train_eval_30b29_00006 | RUNNING | 127.0.0.1:18510 | 0.00025 | Adam | True | 1000 | | | |
| train_eval_30b29_00029 | RUNNING | 127.0.0.1:18513 | 5e-05 | RMS | True | 2500 | | | |
| train_eval_30b29_00000 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 500 | 1 | 1382.33 | 13 |
| train_eval_30b29_00001 | TERMINATED | 127.0.0.1:18505 | 0.0001 | Adam | True | 500 | 1 | 1381.13 | 13 |
| train_eval_30b29_00002 | TERMINATED | 127.0.0.1:18506 | 5e-05 | Adam | True | 500 | 1 | 1364 | -68 |
| train_eval_30b29_00003 | TERMINATED | 127.0.0.1:18507 | 0.00025 | RMS | True | 500 | 1 | 1453.57 | 13 |
| train_eval_30b29_00004 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 500 | 1 | 1443.43 | 10 |
| train_eval_30b29_00005 | TERMINATED | 127.0.0.1:18509 | 5e-05 | RMS | True | 500 | 1 | 1441.61 | 13 |
| train_eval_30b29_00007 | TERMINATED | 127.0.0.1:18511 | 0.0001 | Adam | True | 1000 | 1 | 1369.41 | 11 |
| train_eval_30b29_00008 | TERMINATED | 127.0.0.1:18513 | 5e-05 | Adam | True | 1000 | 1 | 1380.97 | -1 |
| train_eval_30b29_00009 | TERMINATED | 127.0.0.1:18514 | 0.00025 | RMS | True | 1000 | 1 | 1464.87 | 13 |
| train_eval_30b29_00010 | TERMINATED | 127.0.0.1:18506 | 0.0001 | RMS | True | 1000 | 1 | 1451.38 | 10 |
| train_eval_30b29_00011 | TERMINATED | 127.0.0.1:18511 | 5e-05 | RMS | True | 1000 | 1 | 1437.76 | 3 |
| train_eval_30b29_00012 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 1500 | 1 | 1389.67 | 13 |
| train_eval_30b29_00013 | TERMINATED | 127.0.0.1:18513 | 0.0001 | Adam | True | 1500 | 1 | 1382.79 | 13 |
| train_eval_30b29_00014 | TERMINATED | 127.0.0.1:18505 | 5e-05 | Adam | True | 1500 | 1 | 1384.03 | 10 |
| train_eval_30b29_00015 | TERMINATED | 127.0.0.1:18509 | 0.00025 | RMS | True | 1500 | 1 | 1447.43 | 13 |
| train_eval_30b29_00016 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 1500 | 1 | 1452.03 | 9 |
| train_eval_30b29_00017 | TERMINATED | 127.0.0.1:18507 | 5e-05 | RMS | True | 1500 | 1 | 1427.8 | 10 |
| train_eval_30b29_00018 | TERMINATED | 127.0.0.1:18514 | 0.00025 | Adam | True | 2000 | 1 | 1390.14 | 13 |
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
… 100 more trials not shown (100 TERMINATED)

== Status ==
Current time: 2022-08-26 08:11:27 (running for 08:57:01.16)
Memory usage on this node: 19.1/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/10 CPUs, 0/0 GPUs, 0.0/16.53 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/dominikrichard/ray_results/train_eval_2022-08-25_23-14-24
Number of trials: 120/120 (2 RUNNING, 118 TERMINATED)
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
| Trial name | status | loc | lr | optimizer | pre_train | target_update_int… | iter | total time (s) | score |
|------------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------|
| train_eval_30b29_00006 | RUNNING | 127.0.0.1:18510 | 0.00025 | Adam | True | 1000 | | | |
| train_eval_30b29_00029 | RUNNING | 127.0.0.1:18513 | 5e-05 | RMS | True | 2500 | | | |
| train_eval_30b29_00000 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 500 | 1 | 1382.33 | 13 |
| train_eval_30b29_00001 | TERMINATED | 127.0.0.1:18505 | 0.0001 | Adam | True | 500 | 1 | 1381.13 | 13 |
| train_eval_30b29_00002 | TERMINATED | 127.0.0.1:18506 | 5e-05 | Adam | True | 500 | 1 | 1364 | -68 |
| train_eval_30b29_00003 | TERMINATED | 127.0.0.1:18507 | 0.00025 | RMS | True | 500 | 1 | 1453.57 | 13 |
| train_eval_30b29_00004 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 500 | 1 | 1443.43 | 10 |
| train_eval_30b29_00005 | TERMINATED | 127.0.0.1:18509 | 5e-05 | RMS | True | 500 | 1 | 1441.61 | 13 |
| train_eval_30b29_00007 | TERMINATED | 127.0.0.1:18511 | 0.0001 | Adam | True | 1000 | 1 | 1369.41 | 11 |
| train_eval_30b29_00008 | TERMINATED | 127.0.0.1:18513 | 5e-05 | Adam | True | 1000 | 1 | 1380.97 | -1 |
| train_eval_30b29_00009 | TERMINATED | 127.0.0.1:18514 | 0.00025 | RMS | True | 1000 | 1 | 1464.87 | 13 |
| train_eval_30b29_00010 | TERMINATED | 127.0.0.1:18506 | 0.0001 | RMS | True | 1000 | 1 | 1451.38 | 10 |
| train_eval_30b29_00011 | TERMINATED | 127.0.0.1:18511 | 5e-05 | RMS | True | 1000 | 1 | 1437.76 | 3 |
| train_eval_30b29_00012 | TERMINATED | 127.0.0.1:18497 | 0.00025 | Adam | True | 1500 | 1 | 1389.67 | 13 |
| train_eval_30b29_00013 | TERMINATED | 127.0.0.1:18513 | 0.0001 | Adam | True | 1500 | 1 | 1382.79 | 13 |
| train_eval_30b29_00014 | TERMINATED | 127.0.0.1:18505 | 5e-05 | Adam | True | 1500 | 1 | 1384.03 | 10 |
| train_eval_30b29_00015 | TERMINATED | 127.0.0.1:18509 | 0.00025 | RMS | True | 1500 | 1 | 1447.43 | 13 |
| train_eval_30b29_00016 | TERMINATED | 127.0.0.1:18508 | 0.0001 | RMS | True | 1500 | 1 | 1452.03 | 9 |
| train_eval_30b29_00017 | TERMINATED | 127.0.0.1:18507 | 5e-05 | RMS | True | 1500 | 1 | 1427.8 | 10 |
| train_eval_30b29_00018 | TERMINATED | 127.0.0.1:18514 | 0.00025 | Adam | True | 2000 | 1 | 1390.14 | 13 |
±-----------------------±-----------±----------------±--------±------------±------------±-----------------------±-------±-----------------±--------+
… 100 more trials not shown (100 TERMINATED)

You can see there’s only two process not terminated, and my top window gives me the following;
ID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP
18513 python3.10 16.5 02:10:47 56 1 151 678M 160K 417M 18475
18510 python3.10 16.5 73:45.21 56 1 147 455M 160K 202M 184

This is the code i got for using Tuner;
def main():
“”“Launch the tuner
“””
search_space={
‘lr’:tune.grid_search([0.00025, 0.0001, ]),
‘optimizer’:tune.grid_search([‘Adam’,‘RMS’]),
‘pre_train’:tune.grid_search([True]),
‘target_update_interval’:tune.grid_search([i*500 for i in range(1,21)])
}
tuner=tune.Tuner(
trainable=train_eval,
param_space=search_space
)
results=tuner.fit()
print(results.get_best_result(metric=“score”, mode=“min”).config)

Where train_eval is my RL function.

My question is why is it slowing down so much on last process? How do I get to see the other 100 process results?

Thanks

One thing to try is running py-spy dump --pid <trainable_process_id> to see what’s going on, could it actually be stuck?

I am going to take a look at this asap.

Hey @matthewdeng , so this is the output I get from py-spy;

(TFVenv) dominikrichard@Dominiks-MacBook-Pro /Users % sudo py-spy dump --pid 2548
Process 2548: ray::ImplicitFunc.train()
Python v3.10.4 (/opt/homebrew/anaconda3/envs/TFVenv/bin/python3.10)

Thread 0x100E00580 (idle): “MainThread”
wait (threading.py:324)
get (queue.py:180)
step (site-packages/ray/tune/trainable/function_trainable.py:398)
_resume_span (site-packages/ray/util/tracing/tracing_helper.py:466)
train (site-packages/ray/tune/trainable/trainable.py:347)
_resume_span (site-packages/ray/util/tracing/tracing_helper.py:466)
actor_method_executor (site-packages/ray/_private/function_manager.py:674)
main_loop (site-packages/ray/_private/worker.py:756)
(site-packages/ray/_private/workers/default_worker.py:237)
Thread 0x170E03000 (idle): “ray_import_thread”
wait (threading.py:324)
_wait_once (site-packages/grpc/_common.py:106)
wait (site-packages/grpc/_common.py:148)
result (site-packages/grpc/_channel.py:733)
_poll_locked (site-packages/ray/_private/gcs_pubsub.py:249)
poll (site-packages/ray/_private/gcs_pubsub.py:385)
_run (site-packages/ray/_private/import_thread.py:70)
run (threading.py:946)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
Thread 0x355807000 (idle): “Thread-56”
call (site-packages/tensorflow/python/client/session.py:1480)
call (site-packages/keras/backend.py:4284)
train_on_batch (site-packages/keras/engine/training_v1.py:1075)
backward (site-packages/rl/agents/dqn.py:321)
fit (site-packages/rl/core.py:193)
train_eval (keras_dpss_tune.py:173)
_trainable_func (site-packages/ray/tune/trainable/function_trainable.py:684)
_resume_span (site-packages/ray/util/tracing/tracing_helper.py:466)
entrypoint (site-packages/ray/tune/trainable/function_trainable.py:362)
run (site-packages/ray/tune/trainable/function_trainable.py:289)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
Thread 0x171E0F000 (idle): “Thread-74 (_run)”
channel_spin (site-packages/grpc/_channel.py:1258)
run (threading.py:946)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)

Is it hanging here? What version of TF are you using?

How to know if it is hanging there? @matthewdeng
Here is my pip freeze; I am using the latest version of tf-macos with tf-metal
absl-py==1.2.0

aiosignal==1.2.0

ale-py==0.7.5

appnope @ file:///home/conda/feedstock_root/build_artifacts/appnope_1649077682618/work

argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1640817743617/work

argon2-cffi-bindings @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/croot-wbf5edig/argon2-cffi-bindings_1644845754377/work

asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1618968359944/work

astunparse==1.6.3

attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1659291887007/work

backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work

backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work

beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1649463573192/work

bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1656355450470/work

Bottleneck @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_07078715-3ab7-4562-8d3d-d56b0eaa0f7dp504n_ny/croots/recipe/bottleneck_1657175566567/work

cachetools==5.2.0

certifi @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_884c889c-96af-444f-bd6d-daddb5e9a462ykj3l5n_/croots/recipe/certifi_1655968814730/work/certifi

cffi @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_f9af_qffaa/croots/recipe/cffi_1659598650277/work

charset-normalizer==2.1.0

click==8.0.4

cloudpickle==2.1.0

cycler==0.11.0

debugpy @ file:///Users/ktietz/ci_310/debugpy_1643965577625/work

decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work

defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work

distlib==0.3.5

dm-tree==0.1.7

entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work

executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1658852325129/work

fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1658064924516/work/dist

filelock==3.8.0

flatbuffers==1.12

flit_core @ file:///home/conda/feedstock_root/build_artifacts/flit-core_1645629044586/work/source/flit_core

fonttools==4.34.4

frozenlist==1.3.1

gast==0.4.0

gin-config==0.5.0

google-auth==2.9.1

google-auth-oauthlib==0.4.6

google-pasta==0.2.0

grpcio @ file:///Users/ktietz/ci_310/grpcio_1643970354199/work

gym==0.22.0

gym-notices==0.0.7

h5py @ file:///Users/ktietz/ci_310/h5py_1644262287129/work

idna==3.3

igraph==0.9.11

importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1607953814822/work

importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1658604161399/work

ipykernel @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_croot-4yxj69n0/ipykernel_1647009452031/work/dist/ipykernel-6.9.1-py3-none-any.whl

ipython @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_df384d36-235b-4a10-98f5-52bceb3ceae46obihg1k/croots/recipe/ipython_1657652224441/work

ipython-genutils==0.2.0

ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1655973868664/work

jedi @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/croot-f1t6hma6/jedi_1644315882177/work

Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work

joblib==1.1.0

jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1659525086692/work

jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work

jupyter-console @ file:///home/conda/feedstock_root/build_artifacts/jupyter_console_1655961255101/work

jupyter-core @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_5839c60e-0f30-4a99-b0b9-2b009ec5fec40y9fo9ym/croots/recipe/jupyter_core_1651671230982/work

jupyterlab-pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1649936611996/work

jupyterlab-widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1655961217661/work

keras==2.9.0

Keras-Preprocessing==1.1.2

keras-rl2==1.0.5

kiwisolver==1.4.4

libclang==14.0.1

llvmlite==0.39.0

Markdown==3.4.1

MarkupSafe @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_12c133f5-0720-4727-9c18-599a3af825723lzwham3/croots/recipe/markupsafe_1654597866058/work

matplotlib==3.5.2

matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1631080358261/work

mistune==0.8.3

msgpack==1.0.4

msgpack-numpy==0.4.8

nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1656688109017/work

nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert-meta_1649676641343/work

nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1651607001005/work

nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1648959695634/work

networkx==2.8.5

notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1654636967533/work

numba==0.56.0

numexpr @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_08e1f2b1-735c-4635-9755-5afc6d3eb18ew8ilpa0t/croots/recipe/numexpr_1656940301129/work

numpy @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_ba5c3eaa-be93-4f42-ae24-fa720a3c6627m_sp7nbj/croots/recipe/numpy_and_numpy_base_1652801688716/work

oauthlib==3.2.0

opencv-python==4.6.0.66

opt-einsum==3.3.0

packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1637239678211/work

pandas==1.4.3

pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work

parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work

pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1602535608087/work

pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work

Pillow==9.2.0

pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1633981968097/work

platformdirs==2.5.2

ply==3.11

preconditions==0.1

prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1649447152425/work

prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1656332401605/work

protobuf==3.19.4

ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl

pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work

py-spy==0.3.12

pyasn1==0.4.8

pyasn1-modules==0.2.8

pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work

pygame==2.1.0

Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1650904496387/work

pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1652235407899/work

PyQt5-sip==12.11.0

pyrsistent @ file:///Users/ktietz/ci_310/pyrsistent_1643962172005/work

python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work

pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1647961439546/work

PyYAML==6.0

pyzmq @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_8599562e-e9e5-443b-91db-7f7c0ba6aad3mrdoyvz4/croots/recipe/pyzmq_1657724196154/work

qtconsole @ file:///home/conda/feedstock_root/build_artifacts/qtconsole-base_1654465674650/work

QtPy @ file:///home/conda/feedstock_root/build_artifacts/qtpy_1651524593465/work

ray==2.0.0

requests==2.28.1

requests-oauthlib==1.3.1

reverb==2.0.1

rsa==4.9

scipy==1.9.0

seaborn==0.11.2

Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1628511208346/work

sip @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_fbqiv4bzwo/croots/recipe/sip_1659012372184/work

six==1.15.0

soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1658207591808/work

stable-baselines==2.10.2

stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1655315839047/work

tabulate==0.8.10

tensorboard==2.9.1

tensorboard-data-server==0.6.1

tensorboard-plugin-wit==1.8.1

tensorboardX==2.5.1

tensorflow-estimator==2.9.0

tensorflow-macos==2.9.2

tensorflow-metal==0.5.0

tensorflow-probability==0.17.0

Tensorforce==0.6.5

termcolor==1.1.0

terminado @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/croot-n1puqh63/terminado_1644395131327/work

texttable==1.6.4

tf-agents==0.13.0

tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1637612658783/work

toml @ file:///home/conda/feedstock_root/build_artifacts/toml_1604308577558/work

tornado @ file:///Users/ktietz/ci_310/tornado_1643969120498/work

tqdm==4.64.0

traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1655411388954/work

typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1656706066251/work

urllib3==1.26.11

virtualenv==20.16.3

wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work

webencodings==0.5.1

Werkzeug==2.2.0

widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1655939017940/work

wrapt==1.14.1

zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1659400682470/work

Hm if you call py-spy multiple times and it shows the same trace, and it’s at a line of code that isn’t expect to take a long time, then there’s a high chance it’s hanging.

For the output trace, maybe there’s something causing the TensorFlow session to hang…