Model Parallel training with failed training actor


I think I understand that ray re-executes failed tasks and actors based on lineage.

If users train deep learning models in model-parallel fashion and if the training actor dies, will the entire training job fail or just the current epoch where the training actor dies?


cc @kai can you answer his question? I think it is about general fault tolerance of actors.