How to quickly check how many times a TorchTrainer job has restarted?

psnilesh · December 29, 2025, 9:24pm

Following the guide, Handling Failures and Node Preemption — Ray 2.53.0

I have setup worker level fault tolerance like so,

trainer = ray.train.torch.TorchTrainer(
    train_fn_per_worker,
    scaling_config=ray.train.ScalingConfig(num_workers=4),
    run_config=ray.train.RunConfig(
        # [3] Enable worker-level fault tolerance to gracefully handle
        # Train worker failures.
        failure_config=ray.train.FailureConfig(max_failures=3),
    ),
)

I was wondering what was a convenient way to look up how many worker faults have happened in a running Ray job using cli or dashboard. I could see the fault count in the ray job logs it would help to know if there’s an easier way to checking this without grepping through entire driver logs.

RunLLM · December 29, 2025, 9:25pm

There is no direct CLI command or dashboard field that shows the current worker fault count for a running Ray Train job. The Ray Dashboard provides job and worker status, logs, and error messages, but does not expose a real-time counter for worker restarts or faults; you would still need to check logs for detailed fault/retry information. The dashboard’s Jobs view and Train dashboard can help you identify failed attempts and worker restarts, but not as a simple counter—each retry is shown as a new attempt, and you can view the history of attempts and their status details there, which indirectly reflects the number of faults encountered during the run (Ray Train Fault Tolerance Guide, Train Dashboard Monitoring).

For a more granular view, you can inspect the “history of Train run attempts” in the Train dashboard, where each attempt corresponds to a retry after a worker failure. This lets you count the number of retries (faults) without grepping logs, but there is no single “fault count” field exposed via CLI or dashboard at this time (Train Dashboard Monitoring). Would you like more detail on how to use the dashboard to view this information?

Sources:

Hint: Mention @RunLLM in the post for followups.

psnilesh · December 29, 2025, 10:26pm

@RunLLM Do you have any ideas on how this can be supported? I would love to have this available in the ray dashboard.

RunLLM · December 29, 2025, 10:26pm

Currently, the Ray Dashboard does not provide a built-in field or metric for displaying the number of worker faults or restarts in real time for a running job. To support this, Ray would need to expose a new metric (e.g., “worker_fault_count”) at the job or run level, and the dashboard would need to surface this metric in the job or Train run details view. This could be implemented by tracking the number of failed worker attempts and displaying it alongside other run metadata (Train Dashboard Monitoring).

You can suggest this feature by opening a GitHub issue or feature request in the Ray repository, describing the need for a visible worker fault/retry count in the dashboard. This aligns with ongoing efforts to improve observability and expose more internal metrics in Ray (Ray Observability RFC). Would you like more detail on how to propose this or on possible implementation approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Why ray tune restarts my trainings? Ray Tune	7	463	June 27, 2022
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4778	May 8, 2023
Newbi Question: Worker Fault Tolerance?	4	586	February 28, 2022
Reading logs on worker nodes Ray Tune	4	743	March 23, 2022
Ray dashboard says 10k/10k tasks finished but training is not completed Ray Core	0	122	May 10, 2024

How to quickly check how many times a TorchTrainer job has restarted?

Related topics