DEBUG Logs do confirm that trial are being scheduled sequentially. Can you please help, what can I try?
2023-04-13 17:14:06,561 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00000: Changing status from PENDING to RUNNING.
== Status ==
Current time: 2023-04-13 17:14:52 (running for 00:00:50.18)
Memory usage on this node: 6.8/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (95 PENDING, 1 RUNNING)
2023-04-13 17:14:52,013 DEBUG resource_updater.py:46 – Checking Ray cluster resources.
2023-04-13 17:14:52,014 DEBUG trial_runner.py:955 – Got new trial to run: kite-opt-96cpus-96samples_95efb_00001
2023-04-13 17:14:52,017 DEBUG trial_runner.py:993 – Trying to start trial: kite-opt-96cpus-96samples_95efb_00001
2023-04-13 17:14:52,017 DEBUG ray_trial_executor.py:279 – Trial kite-opt-96cpus-96samples_95efb_00001: Status PENDING unchanged.
2023-04-13 17:14:52,031 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:14:52,031 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:14:52,032 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00001: Setting up new remote runner.
2023-04-13 17:14:52,034 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00001: Changing status from PENDING to RUNNING.
== Status ==
Current time: 2023-04-13 17:15:39 (running for 00:01:38.04)
Memory usage on this node: 6.8/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 2.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (94 PENDING, 2 RUNNING)
e[2me[36m(RfqTrainable pid=3693, ip=100.72.46.35)e[0m 2023-04-13 17:15:39,868 INFO trainable.py:172 – Trainable.setup took 39.357 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2023-04-13 17:15:39,887 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:15:39,887 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:15:39,889 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00002: Setting up new remote runner.
2023-04-13 17:15:39,891 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00002: Changing status from PENDING to RUNNING.
== Status ==
Current time: 2023-04-13 17:46:32 (running for 00:32:31.15)
Memory usage on this node: 7.0/62.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 42.0/100 CPUs, 0/0 GPUs, 0.0/145.69 GiB heap, 0.0/62.86 GiB objects
Result logdir: /var/kitetmp/checkpoints/kite-job-230413-180302-000945/output/RfqTrainable_2023-04-13_17-14-01
Number of trials: 96/96 (54 PENDING, 42 RUNNING)
2023-04-13 17:46:32,984 DEBUG resource_updater.py:46 – Checking Ray cluster resources.
2023-04-13 17:46:32,985 DEBUG trial_runner.py:955 – Got new trial to run: kite-opt-96cpus-96samples_95efb_00042
2023-04-13 17:46:32,985 DEBUG trial_runner.py:993 – Trying to start trial: kite-opt-96cpus-96samples_95efb_00042
2023-04-13 17:46:32,986 DEBUG ray_trial_executor.py:279 – Trial kite-opt-96cpus-96samples_95efb_00042: Status PENDING unchanged.
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m INFO:absl:
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m AverageReturn = -10.193077087402344
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m AverageEpisodeLength = 274.2300109863281
e[2me[36m(RfqTrainable pid=8760, ip=100.72.50.214)e[0m 2023-04-13 17:46:32,978 INFO trainable.py:172 – Trainable.setup took 39.665 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2023-04-13 17:46:32,994 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:46:32,994 DEBUG gcs_utils.py:288 – internal_kv_get b’TuneRegistry:global:trainable_class/RfqTrainable’ None
2023-04-13 17:46:32,995 DEBUG ray_trial_executor.py:421 – Trial kite-opt-96cpus-96samples_95efb_00042: Setting up new remote runner.
2023-04-13 17:46:32,997 DEBUG ray_trial_executor.py:281 – Trial kite-opt-96cpus-96samples_95efb_00042: Changing status from PENDING to RUNNING.