Ray Tune PBT Trails pending while resources are available

We have 960 of CPUs available in the cluster and we are using 1 CPU per trail for PBT algo (960 samples/trials). This config is just for testing.

I have tried several time but it never uses all cpus and most of the time only less than 100 trials are running:

Resources requested: 32.0/976 CPUs, 0/0 GPUs, 0.0/2705.67 GiB heap, 0.0/1160.94 GiB objects
2023-01-10 18:23:13,281: Number of trials: 960/960 (944 PENDING, 16 RUNNING)

Can somebody please help what is happing?

Hi @taqreez can you share your full script or snippet of it for us to reproduce ?

Sorry for late reply, I was away.
I could not post previous code I tested due to firm policy so I have rerun this example on same setup with default(FIFO) scheduler: PBT Example — Ray 1.13.0

Setup:

  • Ray over Kubernetes (manual setup no autoscaler)
  • Ray version 1.13
  • Head Pod: 18 CPUs and 32 GB memory
  • 30 Worker Pods with 32 CPUs and 32 Gb mem each pod

Changes to the code:
analysis = tune.run(
PBTBenchmarkExample,
name=“pbt_test”,
# scheduler=pbt, # not using pbt scheduler
metric=“mean_accuracy”,
mode=“max”,
local_dir=,
sync_config=tune.SyncConfig(syncer=None)
# reuse_actors=True,
# checkpoint_freq=0,

    stop={
        "training_iteration": 21,
    },
    num_samples=960,
    config={
        "lr": 0.0001,
        # note: this parameter is perturbed but has no effect on
        # the model training in this example
        "some_other_factor": 1,
    },
    verbose=2,
)

Yes not using pbt scheduler just testing resource use with default (FIFO) scheduler.
I was expecting that all 960 trails corresponding to 960 samples will run in parallel as I enough cpus and memory.

It took like 13 minutes to schedule all trails:

From Logs:
Logs:
==> at the beginning
2023-01-19 16:43:04,573 WARNING trial_runner.py:297 – The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (1075 CPUs/pending trials). If you’re running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the TUNE_MAX_PENDING_TRIALS_PG environment variable to the desired maximum number of concurrent trials.
== Status ==
Current time: 2023-01-19 16:43:31 (running for 00:00:26.87)
Memory usage on this node: 14.2/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (959 PENDING, 1 RUNNING)

===> After some time
== Status ==
Current time: 2023-01-19 16:51:38 (running for 00:08:33.99)
Memory usage on this node: 15.5/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 33.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (927 PENDING, 33 RUNNING)

==> and later on
== Status ==
Current time: 2023-01-19 16:58:05 (running for 00:15:00.52)
Memory usage on this node: 16.0/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 194.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (766 PENDING, 194 RUNNING)

==> It did reach almost all trials in parallel but after a long time a long time:
== Status ==
Current time: 2023-01-19 16:59:54 (running for 00:16:50.37)
Memory usage on this node: 17.0/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 956.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Current best trial: 57e9b_00910 with mean_accuracy=8.646410925525906 and parameters={‘lr’: 0.0001, ‘some_other_factor’: 1}
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (4 PENDING, 956 RUNNING)

Errors Logs:

===> RPC Error for few minutes while some trials were running
(raylet) [2023-01-19 16:44:35,313 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(10946) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
(raylet) [2023-01-19 16:44:35,318 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(10961) have not registered within the timeout. The process is still alive, probably it’s hanging during start.

(pid=gcs_server) [2023-01-19 16:48:15,518 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:16,515 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:17,516 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:18,516 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:19,518 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:20,519 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(raylet) [2023-01-19 16:48:21,113 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.

Number of trials: 960/960 (942 PENDING, 18 RUNNING)

(pid=gcs_server) [2023-01-19 16:48:24,525 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
(pid=gcs_server) [2023-01-19 16:48:25,526 E 15 15] (gcs_server) gcs_server.cc:283: Failed to get the resource load: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

===> Now above error is gone and we Broken Pipe error

== Status ==
Current time: 2023-01-19 16:52:09 (running for 00:09:04.51)
Memory usage on this node: 15.5/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 36.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (924 PENDING, 36 RUNNING)

2023-01-19 16:52:09,390 WARNING worker.py:1404 – The node with node id: 664e78a4c6f48d1e26cce51553db6ede5d9e39fe72acb976b101a14c and address: 100.xx.xx.6 and node name: 100.xx.xx.6 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(raylet) [2023-01-19 16:52:10,588 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
(raylet) [2023-01-19 16:52:10,627 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(12748) have not registered within the timeout. The process is still alive, probably it’s hanging during start.
2023-01-19 16:52:12,389 WARNING worker.py:1404 – The node with node id: bbb5ece21ad0a5d0791fbe6c2524b8a3126e7e42992c69e7e6e8b8f6 and address: 100.xx.xx.141 and node name: 100.xx.xx.141 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2023-01-19 16:52:14,390 WARNING worker.py:1404 – The node with node id: f0e02255ae059672f5bbf0ae3399bb420277bb3a8e8d59a564ddf8ed and address: 100.xx.xx.243 and node name: 100.xx.xx.243 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(raylet) [2023-01-19 16:52:17,989 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.

===> Last broken Pipe error:
(raylet) [2023-01-19 16:54:26,907 E 80 80] (raylet) client_connection.cc:318: Broken Pipe happened during calling ServerConnection::DoAsyncWrites.
== Status ==
Current time: 2023-01-19 16:55:37 (running for 00:12:33.22)
Memory usage on this node: 15.6/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 37.0/914 CPUs, 0/0 GPUs, 0.0/643.5 GiB heap, 0.0/277.15 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (923 PENDING, 37 RUNNING)

=== > While most trials were there still this GOAWAY error

(bundle_reservation_check_func pid=13828) E0119 16:59:21.122163883 14101 chttp2_transport.cc:1128] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to “too_many_pings”

==> Now the frequent worker registration error also goes away
(raylet) [2023-01-19 16:54:21,562 E 80 80] (raylet) worker_pool.cc:502: Some workers of the worker process(13452) have not registered within the timeout. The process is still alive, probably it’s hanging during start.

== Status ==
Current time: 2023-01-19 16:59:21 (running for 00:16:17.35)
Memory usage on this node: 16.6/373.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 776.0/978 CPUs, 0/0 GPUs, 0.0/688.1 GiB heap, 0.0/296.26 GiB objects
Result logdir: /var/kitetmp/checpoints/pbt_test
Number of trials: 960/960 (184 PENDING, 776 RUNNING)