It took like 13 minutes to schedule all trails:
From Logs:
Logs:
==> at the beginning
2023-01-19 16:43:04,573 WARNING trial_runner.py:297 – The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (1075 CPUs/pending trials). If you’re running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the TUNE_MAX_PENDING_TRIALS_PG
environment variable to the desired maximum number of concurrent trials.
== Status ==
Current time: 2023-01-19 16:43:31 (running for 00:00:26.87)
Memory usage on this node: 14.2/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (959 PENDING, 1 RUNNING)
===> After some time
== Status ==
Current time: 2023-01-19 16:51:38 (running for 00:08:33.99)
Memory usage on this node: 15.5/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 33.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (927 PENDING, 33 RUNNING)
==> and later on
== Status ==
Current time: 2023-01-19 16:58:05 (running for 00:15:00.52)
Memory usage on this node: 16.0/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 194.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (766 PENDING, 194 RUNNING)
==> It did reach almost all trials in parallel but after a long time a long time:
== Status ==
Current time: 2023-01-19 16:59:54 (running for 00:16:50.37)
Memory usage on this node: 17.0/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 956.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Current best trial: 57e9b_00910 with mean_accuracy=8.646410925525906 and parameters={‘lr’: 0.0001, ‘some_other_factor’: 1}
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (4 PENDING, 956 RUNNING)
Errors Logs:
===> RPC Error for few minutes while some trials were running
(raylet) [2023-01-19 16:44:35,313 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(10946) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
(raylet) [2023-01-19 16:44:35,318 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(10961) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
(pid=gcs_server) [2023-01-19 16:48:15,518 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:16,515 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:17,516 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:18,516 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:19,518 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:20,519 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(raylet) [2023-01-19 16:48:21,113 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
Number of trials: 960/960 (942 PENDING, 18 RUNNING)
(pid=gcs_server) [2023-01-19 16:48:24,525 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:25,526 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
===> Now above error is gone and we Broken Pipe error
== Status ==
Current time: 2023-01-19 16:52:09 (running for 00:09:04.51)
Memory usage on this node: 15.5/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 36.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (924 PENDING, 36 RUNNING)
2023-01-19 16:52:09,390 WARNING worker.py:1404 – The node with node id: 664e78a4c6f48d1e26cce51553db6ede5d9e39fe72acb976b101a14c and address: 100.xx.xx.6 and node name: 100.xx.xx.6 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(raylet) [2023-01-19 16:52:10,588 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
(raylet) [2023-01-19 16:52:10,627 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(12748) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
2023-01-19 16:52:12,389 WARNING worker.py:1404 – The node with node id: bbb5ece21ad0a5d0791fbe6c2524b8a3126e7e42992c69e7e6e8b8f6 and address: 100.xx.xx.141 and node name: 100.xx.xx.141 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2023-01-19 16:52:14,390 WARNING worker.py:1404 – The node with node id: f0e02255ae059672f5bbf0ae3399bb420277bb3a8e8d59a564ddf8ed and address: 100.xx.xx.243 and node name: 100.xx.xx.243 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(raylet) [2023-01-19 16:52:17,989 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
===> Last broken Pipe error:
(raylet) [2023-01-19 16:54:26,907 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
== Status ==
Current time: 2023-01-19 16:55:37 (running for 00:12:33.22)
Memory usage on this node: 15.6/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 37.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (923 PENDING, 37 RUNNING)
=== > While most trials were there still this GOAWAY error
(bundle_reservation_check_func pid=13828) E0119 16:59:21.122163883 14101 chttp2_transport.cc:1128] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to “too_many_pings”
==> Now the frequent worker registration error also goes away
(raylet) [2023-01-19 16:54:21,562 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(13452) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
== Status ==
Current time: 2023-01-19 16:59:21 (running for 00:16:17.35)
Memory usage on this node: 16.6/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 776.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (184 PENDING, 776 RUNNING)