Is Trial Scheduling sequential?

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

I have 100 CPUs and 100 trials with 1 cpu per trial, if each trail Actor/trainable takes 60 seconds in setup method, is going to take almost 100 min to schedule all trials?

In my case trials are moving to run state one by one even CPUs are available:

INFO:kite:2023-04-13 13:51:39,350: Current time: 2023-04-13 12:51:39 (running for 00:32:41.53)
INFO:kite:2023-04-13 13:51:39,350: Memory usage on this node: 6.9/62.0 GiB
INFO:kite:2023-04-13 13:51:39,350: PopulationBasedTraining: 0 checkpoints, 0 perturbs
INFO:kite:2023-04-13 13:51:39,350: Resources requested: 41.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
INFO:kite:2023-04-13 13:51:39,351: Result logdir: /var/kitetmp/checkpoints/kite-job-230413-131010-274165/output/RfqTrainable_2023-04-13_12-18-57
INFO:kite:2023-04-13 13:51:39,351: Number of trials: 96/96 (55 PENDING, 41 RUNNING)

INFO:kite:2023-04-13 13:52:30,451: Current time: 2023-04-13 12:52:29 (running for 00:33:31.86)
INFO:kite:2023-04-13 13:52:30,451: Memory usage on this node: 7.0/62.0 GiB
INFO:kite:2023-04-13 13:52:30,451: PopulationBasedTraining: 0 checkpoints, 0 perturbs
INFO:kite:2023-04-13 13:52:30,451: Resources requested: 42.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
INFO:kite:2023-04-13 13:52:30,451: Result logdir: /var/kitetmp/checkpoints/kite-job-230413-131010-274165/output/RfqTrainable_2023-04-13_12-18-57
INFO:kite:2023-04-13 13:52:30,452: Number of trials: 96/96 (54 PENDING, 42 RUNNING)

Hi @taqreez, generally the trials should all run in parallel, so the setup should - ideally - only take 60 seconds.

The trial status updates are handled sequentially (so the trials move one by one), but this should happen very fast.

So what you should see (with 2 trials):

t=0.00, start trial A
t=0.01, start trial B
... wait for setup
t=60.01 trial A starts training
t=60.01 trial B starts training

The .00 and .01 is just for illustration purposes

Thanks Kai for quick response but for me it is taking almost 60*100 seconds to get all trails into running state as you can see in logs it is taking around a min and I check logs as follows. It show it took 31 mins to get 42 trails in running state. My cluster has 5 worker pods with 20 CPUs.

INFO:kite:2023-04-13 13:19:45,795: == Status ==
INFO:kite:2023-04-13 13:19:45,796: Current time: 2023-04-13 12:19:45 (running for 00:00:47.71)
INFO:kite:2023-04-13 13:19:45,796: Memory usage on this node: 6.2/62.0 GiB
INFO:kite:2023-04-13 13:19:45,796: PopulationBasedTraining: 0 checkpoints, 0 perturbs
INFO:kite:2023-04-13 13:19:45,796: Resources requested: 1.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
INFO:kite:2023-04-13 13:19:45,797: Result logdir: /var/kitetmp/checkpoints/kite-job-230413-131010-274165/output/RfqTrainable_2023-04-13_12-18-57
INFO:kite:2023-04-13 13:19:45,797: Number of trials: 96/96 (95 PENDING, 1 RUNNING)

INFO:kite:2023-04-13 13:52:30,451: Current time: 2023-04-13 12:52:29 (running for 00:33:31.86)
INFO:kite:2023-04-13 13:52:30,451: Memory usage on this node: 7.0/62.0 GiB
INFO:kite:2023-04-13 13:52:30,451: PopulationBasedTraining: 0 checkpoints, 0 perturbs
INFO:kite:2023-04-13 13:52:30,451: Resources requested: 42.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
INFO:kite:2023-04-13 13:52:30,451: Result logdir: /var/kitetmp/checkpoints/kite-job-230413-131010-274165/output/RfqTrainable_2023-04-13_12-18-57
INFO:kite:2023-04-13 13:52:30,452: Number of trials: 96/96 (54 PENDING, 42 RUNNING)

How can I find out where is the delay? is possible to enable debug log or something else?

DEBUG Logs do confirm that trial are being scheduled sequentially. Can you please help, what can I try?

2023-04-13 17:14:06,561 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00000: Changing status from PENDING to RUNNING.

== Status ==
Current time: 2023-04-13 17:14:52 (running for 00:00:50.18)
Memory usage on this node: 6.8/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (95 PENDING, 1 RUNNING)

2023-04-13 17:14:52,013 DEBUG resource_updater.py:46 – Checking Ray cluster resources.
2023-04-13 17:14:52,014 DEBUG trial_runner.py:955 – Got new trial to run: kite-opt-96cpus-96samples_95efb_00001
2023-04-13 17:14:52,017 DEBUG trial_runner.py:993 – Trying to start trial: kite-opt-96cpus-96samples_95efb_00001
2023-04-13 17:14:52,017 DEBUG ray_trial_executor.py:279 – Trial kite-opt-96cpus-96samples_95efb_00001: Status PENDING unchanged.
2023-04-13 17:14:52,031 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:14:52,031 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:14:52,032 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00001: Setting up new remote runner.
2023-04-13 17:14:52,034 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00001: Changing status from PENDING to RUNNING.

== Status ==
Current time: 2023-04-13 17:15:39 (running for 00:01:38.04)
Memory usage on this node: 6.8/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 2.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (94 PENDING, 2 RUNNING)

e[2me[36m(RfqTrainable pid=3693, ip=100.72.46.35)e[0m 2023-04-13 17:15:39,868 INFO trainable.py:172 – Trainable.setup took 39.357 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2023-04-13 17:15:39,887 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:15:39,887 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:15:39,889 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00002: Setting up new remote runner.
2023-04-13 17:15:39,891 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00002: Changing status from PENDING to RUNNING.

== Status ==
Current time: 2023-04-13 17:46:32 (running for 00:32:31.15)
Memory usage on this node: 7.0/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 42.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (54 PENDING, 42 RUNNING)

2023-04-13 17:46:32,984 DEBUG resource_updater.py:46 – Checking Ray cluster resources.
2023-04-13 17:46:32,985 DEBUG trial_runner.py:955 – Got new trial to run: kite-opt-96cpus-96samples_95efb_00042
2023-04-13 17:46:32,985 DEBUG trial_runner.py:993 – Trying to start trial: kite-opt-96cpus-96samples_95efb_00042
2023-04-13 17:46:32,986 DEBUG ray_trial_executor.py:279 – Trial kite-opt-96cpus-96samples_95efb_00042: Status PENDING unchanged.
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m INFO:absl:
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m AverageReturn = -10.193077087402344
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m AverageEpisodeLength = 274.2300109863281
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m 2023-04-13 17:46:32,978 INFO trainable.py:172 – Trainable.setup took 39.665 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2023-04-13 17:46:32,994 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:46:32,994 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:46:32,995 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00042: Setting up new remote runner.
2023-04-13 17:46:32,997 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00042: Changing status from PENDING to RUNNING.