Changing training_iterations from 20 to 50 is more than 20x my run time

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m running on fairly small hardware only 2x3090s. Running PBT, the code is here, https://github.com/getorca/mamba_for_sequence_classification/blob/main/examples/train_finacial_phrasebank_ray_pbt.py

With 20 training_iterations the code completes in around 30mins. However now that I’ve bumped it up to 50, it’s been running for over 12 hours, Current time: 2024-04-26 02:36:16. Total running time: 12hr 2min 45s

I’m not sure how to debug this it looks like it’s run more than 50 iterations looking at the iter col below:

Trial status: 2 RUNNING | 1 PENDING | 1 PAUSED
Current time: 2024-04-26 02:36:47. Total running time: 12hr 3min 16s
Logical resource usage: 10.0/16 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:G)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                 status       ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)     eval_loss     eval_accuracy     eval_runtime     ...amples_per_second │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_c9b0a_00000   RUNNING               0.000214699                0.0216393       40            17362.3      0.537224          0.798578          16.0341                   26.319 │
│ TorchTrainer_c9b0a_00001   RUNNING               0.000428546                0.0259672       39            17016.5      0.448538          0.824645          13.102                    32.209 │
│ TorchTrainer_c9b0a_00003   PAUSED                0.000297601                0.0270491       39            17029.5      0.448538          0.824645          13.1308                   32.138 │
│ TorchTrainer_c9b0a_00002   PENDING               0.000357121                0.0216393       39            17016.5      0.448538          0.824645          13.102                    32.209 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Any help debugging this would be appreciated.

1 Like

Very confused as to whats happening, seems like it should be 50 iterations per sample. However sometimes with parameters changes it stops/terminates after just a few iterations, well before any stopping criteria is met.

Update: I can easily and constantly recreate the “early stopping” condition when num_samples=2 and perturbation_interval=2 :man_shrugging:

Current version: ray 3.0.0.dev0

Also tested with ray-2.12.0 and happens