Any suggestions on how to debug the distributed torch trainer

Hi, I have developed a distributed NLP framework using RaySGD, it works well, but after several commits, the code starts to crash with the following error:

Traceback (most recent call last):
  File "", line 760, in <module>
  File "/opt/tiger/runner/runner_lite/runner_lite/op/", line 292, in __call__
    res =*args, **kwargs)
  File "", line 626, in run
    'eval_batch_size': self.getc('evaluate.batch_size'),
  File "", line 463, in run_train_purely
    model_train_result = ModelTrainerRunner()
  File "/opt/tiger/runner/runner_lite/runner_lite/op/", line 292, in __call__
    res =*args, **kwargs)
  File "/opt/tiger/runner/runner_lite/runner_lite/op/", line 209, in run
    model_train_result = ModelTrainerRunner()
  File "/opt/tiger/runner/runner_lite/runner_lite/op/", line 292, in __call__
    res =*args, **kwargs)
  File "/opt/tiger/runner/runner_lite/runner_lite/op/ptx_v1/", line 60, in run
    return fit(option)
  File "/opt/tiger/runner/rtc/rtc/", line 152, in fit
    metric =
  File "/opt/tiger/runner/ptx/ptx/", line 1992, in fit
    train_metrics = trainer.train() # this returns
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/util/sgd/torch/", line 415, in train
    num_steps=num_steps, profile=profile, info=info, dataset=dataset)
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/util/sgd/torch/", line 325, in train
    success = check_for_failure(remote_worker_stats)
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/util/sgd/", line 244, in check_for_failure
    finished = ray.get(finished)
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/_private/", line 47, in wrapper
    return func(*args, **kwargs)
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.train_epoch() (pid=303, ip=
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/util/sgd/torch/", line 112, in train_epoch
    num_steps=num_steps, profile=profile, info=info, iterator=iterator)
  File "/data00/jialin.liu/.local/lib/python3.7/site-packages/ray/util/sgd/torch/", line 140, in train_epoch
    train_stats = self.training_operator.train_epoch(iterator, info)
  File "/opt/tiger/runner/ptx/ptx/", line 2430, in train_epoch
    metrics = self.train_batch(batch, batch_info=batch_info)
  File "/opt/tiger/runner/ptx/ptx/", line 2321, in train_batch
  File "/usr/local/lib/python3.7/dist-packages/torch/", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Operation timed out!

Any suggestions on how to debug this error?

@valiantljk thanks for reporting this! Is this error happening consistently? How many workers are you using? Can you see if you are getting the same error with 1 worker. Also, are you using the gloo backend or nccl?

Thanks Amog,
Yes, I start to see it consistently. I’m using 8 GPU (8 trainer) and nccl (I believe it’s the default in RaySGD?)
I’ll try with one worker.

Just tested with two trainer, no issues.
I’m wondering if it’s due to another concurrent application.
When I tested with 8 GPU, there was another job running on GPU 1.

Just reran the 8 trainer case in a clean environment. Crashed with the same error.

Any general suggestions on how to debug?

Since you are using NCCL you can set the debug environment variable to get some more useful output.

Can you set this in your code:

def initialization_hook():
    print("NCCL DEBUG SET")
    os.environ["NCCL_DEBUG"] = "INFO"

trainer = TorchTrainer(..., initialization_hook=initialization_hook)

Cool, I’ll try it! Thanks.