Model Parallel training with failed training actor

asm582 · August 5, 2021, 10:38pm

Hello,

I think I understand that ray re-executes failed tasks and actors based on lineage.

If users train deep learning models in model-parallel fashion and if the training actor dies, will the entire training job fail or just the current epoch where the training actor dies?

Thanks

sangcho · August 9, 2021, 10:18pm

cc @kai can you answer his question? I think it is about general fault tolerance of actors.

Topic		Replies	Views
[RaySGD] Training instability Ray Train	6	1052	March 17, 2021
After running ray for a long time, it shows that the worker has been killed	0	37	June 10, 2024
Distributed torch model training with Ray Core APIs Ray Core	3	488	November 3, 2023
Actor died while training because of raylet lost connection Kubernetes	0	362	October 14, 2022
Root cause the actor is dead because its node has died	1	758	August 2, 2023

Model Parallel training with failed training actor

Related topics