Ray fails to finish a parallel run when one node is blocked

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it. (By not using ray.)

If any of our nodes stop running workers, it blocks the entire process from finishing. I am trying to figure out how to fix this. (More detail and how to replicate at: [Core] ray fails to finish a parallel run when one node is blocked · Issue #28071 · ray-project/ray · GitHub ) It is odd, because it can run most of the ray.remote’s but fails on the last few. This has occurred in past versions of ray (starting with 1.9), and still exists in 2.0 as well

I have followed up on the github issues that you posted. Thanks!

1 Like

Thank you. Yes, I agree that [Ray component: Core] Add Way to limit number of ray workers · Issue #27499 · ray-project/ray · GitHub would probably prevent us from hitting #28071, but I still think #28071 is a bug that needs to be solved as well. (I also replied at #28071)