Can I see the config.yaml you used here? I am trying to run the same exact code with a similar setup, but something goes wrong and I can’e even get the training to start. See more here: Cuda Error: invalid device ordinal during training on GCP cluster
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Ray Cluster, why does the program freeze and stop executing when the number of GPUs required by the program requires the GPUs of two machines | 0 | 293 | January 14, 2023 | |
|
Cuda Error: invalid device ordinal during training on GCP cluster
|
0 | 264 | September 11, 2024 | |
| Distributed pytorch on cluster | 4 | 572 | June 9, 2021 | |
| CUDA-capable device(s) is/are busy or unavailable | 1 | 973 | February 1, 2023 | |
| Ray Train doesn't detect GPU | 4 | 2057 | January 7, 2022 |