[RaySGD] Training instability

Hi all! We are using RaySGD + PyTorch + GCP to train our neural networks faster using more GPUs. Scaling is amazing, we are very satisfied with training speed. But sometimes cluster fails because of unknown reasons, usually it is happening after about 5 hours of training. The most common error: The actor died unexpectedly before finishing this task. Could you help to understand the reason of such an error or explain how can I debug it?

1 Like

@rliaw @kai @amogkam might have some suggestions.

Hi @Vanster do you have a longer stack trace you can share? Also, is this running on preemptible/spot instances where the nodes can die during the training process?

I will try to reproduce my issue to provide a longer stack trace if needed. Yes, cluster is running on preemptible instances.

@amogkam I generally see two type of errors. The most frequent:

E0313 21:46:49.631943 12120 12167 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.distributed_torch_runner, class_name=DistributedTorchRunner, function_name=setup_process_group, function_hash=}, task_id=64086456c32b67a1b32e035b01000000, task_name=DistributedTorchRunner.setup_process_group(), job_id=01000000, num_args=8, num_returns=2, actor_task_spec={actor_id=b32e035b01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}
E0313 21:46:49.632853 12120 12167 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.distributed_torch_runner, class_name=DistributedTorchRunner, function_name=setup_process_group, function_hash=}, task_id=ad0f52840d014335bcc2847e01000000, task_name=DistributedTorchRunner.setup_process_group(), job_id=01000000, num_args=8, num_returns=2, actor_task_spec={actor_id=bcc2847e01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}
2021-03-13 21:46:49,633 WARNING worker.py:1091 -- The node with node id 2a4aa3a932095d151585995c94be9f81b440d7be has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Traceback (most recent call last):
  File "RayPTTrainer.py", line 676, in <module>
    main(_args)
  File "RayPTTrainer.py", line 640, in main
    trainer.train(num_steps=train_config.snapshot, info=train_info)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 420, in train
    self._resize_worker_group()
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 343, in _resize_worker_group
    self._start_workers(int(new_workers))
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 324, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1454, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Sometimes it is the next one:

Traceback (most recent call last):
  File "RayPTTrainer.py", line 676, in <module>
    main(_args)
  File "RayPTTrainer.py", line 640, in main
    trainer.train(num_steps=train_config.snapshot, info=train_info)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 427, in train
    dataset=dataset)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 325, in train
    success = check_for_failure(remote_worker_stats)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/utils.py", line 244, in check_for_failure
    finished = ray.get(finished)
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.train_epoch() (pid=13191, ip=10.140.0.40)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 113, in train_epoch
    num_steps=num_steps, profile=profile, info=info, iterator=iterator)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_runner.py", line 143, in train_epoch
    train_stats = self.training_operator.train_epoch(iterator, info)
  File "RayPTTrainer.py", line 247, in train_epoch
    return self.epoch(iterator, info, test=False)
  File "RayPTTrainer.py", line 434, in epoch
    sum_loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL error: unhandled system error, NCCL version 2.7.8
E0312 23:05:12.899838 11123 11167 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.util.sgd.torch.distributed_torch_runner, class_name=DistributedTorchRunner, function_name=train_epoch, function_hash=}, task_id=9fa9202688f4f44adfbd71fd01000000, task_name=DistributedTorchRunner.train_epoch(), job_id=01000000, num_args=6, num_returns=2, actor_task_spec={actor_id=dfbd71fd01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=5}
2021-03-12 23:05:13,182 DEBUG worker.py:1035 -- Suppressing error from worker: ray::DistributedTorchRunner.train_epoch() (pid=13191, ip=10.140.0.40)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 113, in train_epoch
    num_steps=num_steps, profile=profile, info=info, iterator=iterator)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/sgd/torch/torch_runner.py", line 143, in train_epoch
    train_stats = self.training_operator.train_epoch(iterator, info)
  File "RayPTTrainer.py", line 247, in train_epoch
    return self.epoch(iterator, info, test=False)
  File "RayPTTrainer.py", line 434, in epoch
    sum_loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL error: unhandled system error, NCCL version 2.7.8

Hey @Vanster,

So this is probably what’s happening with the first issue:

  1. Node goes down in the middle of training
  2. This triggers the current training epoch to retry. Internally, RaySGD shuts down all the workers and then creates them again with the new available resources, and then re-trains the current epoch.
  3. During this worker startup process, another node goes down, and we don’t handle fault tolerance for this case.

I made a PR [SGD] Worker Startup Fault Tolerance by amogkam · Pull Request #14724 · ray-project/ray · GitHub to add functionality to handle failures during worker startup. After it gets merged, you can try it out on the latest Ray wheels.

As for the second issue, this is more of a Pytorch/NCCL issue. Can set the NCCL_DEBUG=INFO environment variable in your code to get more nccl output. You can follow the example here to set the environment variable for all workers: Distributed PyTorch — Ray v2.0.0.dev0

Great, thank you very much!