Hello,
I think I understand that ray re-executes failed tasks and actors based on lineage.
If users train deep learning models in model-parallel fashion and if the training actor dies, will the entire training job fail or just the current epoch where the training actor dies?
Thanks