Horovod Trainer hangs

  • High: It blocks me to complete my task.

I use Horovod trainer. The train job hangs after many iterations without failing. The horovod timeout should be whatever ray default.
I expect the job to fail with exception thrown. Now the job status is still running but the GPU is not been utilized and the training logs doesn’t update.

Can someone suggest any debugging tips?

Hey, one suggestion is to use the Ray Dashboard to find the stack trace of the training worker processes, to identify where the job is hanging.

Thanks Matthew for taking a look. The stack trace is as follow:

Process 30141: ray::RayTrainWorker._RayTrainWorker__execute
Python v3.10.2 (/export/apps/python/3.10.2/bin/python3.10)

Thread 30141 (idle): "MainThread"
    wait (threading.py:324)
    get (queue.py:180)
    get_next (ray/train/_internal/session.py:175)
    get_next (ray/train/_internal/backend_executor.py:425)
    __execute (ray/train/_internal/worker_group.py:28)
    _resume_span (ray/util/tracing/tracing_helper.py:466)
    actor_method_executor (ray/_private/function_manager.py:674)
    main_loop (ray/_private/worker.py:763)
    <module> (ray/_private/workers/default_worker.py:233)
Thread 30233 (idle): "ray_import_thread"
    wait (threading.py:324)
    _wait_once (grpc/_common.py:112)
    wait (grpc/_common.py:157)
    result (grpc/_channel.py:733)
    _poll_locked (ray/_private/gcs_pubsub.py:255)
    poll (ray/_private/gcs_pubsub.py:391)
    _run (ray/_private/import_thread.py:69)
    run (threading.py:946)
    _bootstrap_inner (threading.py:1009)
    _bootstrap (threading.py:966)
Thread 30349 (idle): "Thread-11 (trainable_fn)"
    quick_execute (tensorflow/python/eager/execute.py:52)
    call (tensorflow/python/eager/polymorphic_function/monomorphic_function.py:378)
    _call_flat (tensorflow/python/eager/polymorphic_function/monomorphic_function.py:1745)
    __call__ (tensorflow/python/eager/polymorphic_function/tracing_compiler.py:134)
    _call (tensorflow/python/eager/polymorphic_function/polymorphic_function.py:912)
    __call__ (tensorflow/python/eager/polymorphic_function/polymorphic_function.py:880)
    error_handler (tensorflow/python/util/traceback_utils.py:150)
    fit (keras/engine/training.py:1650)
    error_handler (keras/utils/traceback_utils.py:65)
    train (trainer.py:270)
    main (libenchmarkingpipeline/run_finetune_and_eval_pipeline.py:171)
    trainable_fn (utils/convert_to_ray_trainable.py:65)
    discard_return_wrapper (ray/train/_internal/utils.py:129)
    train_fn (ray/train/_internal/utils.py:149)
    run (ray/air/_internal/util.py:88)
    _bootstrap_inner (threading.py:1009)
    _bootstrap (threading.py:966)
Thread 32010 (idle): "Thread-61 (_run)"
    channel_spin (grpc/_channel.py:1258)
    run (contextvars/__init__.py:38)
    run (threading.py:946)
    _bootstrap_inner (threading.py:1009)
    _bootstrap (threading.py:966)

I am not sure how to interpret the stack trace here. Any insight?

I have found the root cause. It is because the training data was not sharded properly so training finished earlier for some workers than the other. So hypothetically horovod will timeout.

I expect Ray to let the job fail instead of hanging there for hours. Do I need to set anything to get the expected behavior?

Oh nice, glad you were able to find the root cause!

Hmm do you know if Horovod (or TensorFlow) would actually timeout in this scenario? My understanding is that hanging may still occur while waiting to synchronize the gradients.

How did you know / could check that the data was not sharded properly? What did you do to fix it? I possibly have the same issue.