Raylet retry forever when the submitted job fails at runtime_env creation

404notfound101 · October 16, 2025, 9:14pm

Hi team,

I encountered an issue where a job submitted to the Ray cluster failed during runtime_env installation. The Raylet continuously retries the installation and ends up stuck indefinitely. As a result, the job remains pending and cannot be stopped.

I’m not sure if this is the intended behavior (which would be quite unexpected) or a problem specific to my setup. Is there a way to configure a maximum number of retry attempts?

Any insights or suggestions would be appreciated.

RunLLM · October 16, 2025, 9:15pm

By default, Ray retries failed runtime environment (runtime_env) installations up to 3 times for tasks, but for job-level runtime_env setup, if the installation fails, the job should transition to FAILED rather than remain stuck in PENDING. However, there have been reports of jobs getting stuck in PENDING if the runtime_env setup hangs or fails in certain ways, making the job unresponsive to stop commands—this is not intended behavior and has been addressed in recent Ray releases (see this issue and fix).

To control retries for tasks, use the max_retries argument in @ray.remote or set the RAY_TASK_MAX_RETRIES environment variable. For jobs, the max_retries field in the job config controls how many times the job is retried after a failure, but this does not directly affect runtime_env installation retries. If you are seeing jobs stuck in PENDING due to runtime_env failures, upgrading to Ray 2.49 or later is recommended, as this includes fixes to ensure jobs fail properly after a timeout or repeated setup failures (details, docs).

Would you like more detail on how to set these parameters or debug runtime_env issues?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Failed to lease worker from node Ray Core	9	723	February 23, 2022
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1270	October 18, 2021
Custom conda environment does not allow jobs to execute Ray Clusters	6	1392	May 27, 2022
Ray job submit errors on Kubernetes Ray Core	15	2423	June 28, 2022
How does "container" in "runtime_env" work? Ray Core	7	933	June 2, 2023

Raylet retry forever when the submitted job fails at runtime_env creation

Related topics