How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m running on fairly small hardware only 2x3090s. Running PBT, the code is here, https://github.com/getorca/mamba_for_sequence_classification/blob/main/examples/train_finacial_phrasebank_ray_pbt.py
With 20 training_iterations the code completes in around 30mins. However now that I’ve bumped it up to 50, it’s been running for over 12 hours, Current time: 2024-04-26 02:36:16. Total running time: 12hr 2min 45s
I’m not sure how to debug this it looks like it’s run more than 50 iterations looking at the iter col below:
Trial status: 2 RUNNING | 1 PENDING | 1 PAUSED
Current time: 2024-04-26 02:36:47. Total running time: 12hr 3min 16s
Logical resource usage: 10.0/16 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:G)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...fig/learning_rate ...nfig/weight_decay iter total time (s) eval_loss eval_accuracy eval_runtime ...amples_per_second │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TorchTrainer_c9b0a_00000 RUNNING 0.000214699 0.0216393 40 17362.3 0.537224 0.798578 16.0341 26.319 │
│ TorchTrainer_c9b0a_00001 RUNNING 0.000428546 0.0259672 39 17016.5 0.448538 0.824645 13.102 32.209 │
│ TorchTrainer_c9b0a_00003 PAUSED 0.000297601 0.0270491 39 17029.5 0.448538 0.824645 13.1308 32.138 │
│ TorchTrainer_c9b0a_00002 PENDING 0.000357121 0.0216393 39 17016.5 0.448538 0.824645 13.102 32.209 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Any help debugging this would be appreciated.