Raylet retry forever when the submitted job fails at runtime_env creation

Hi team,

I encountered an issue where a job submitted to the Ray cluster failed during runtime_env installation. The Raylet continuously retries the installation and ends up stuck indefinitely. As a result, the job remains pending and cannot be stopped.

I’m not sure if this is the intended behavior (which would be quite unexpected) or a problem specific to my setup. Is there a way to configure a maximum number of retry attempts?

Any insights or suggestions would be appreciated.

By default, Ray retries failed runtime environment (runtime_env) installations up to 3 times for tasks, but for job-level runtime_env setup, if the installation fails, the job should transition to FAILED rather than remain stuck in PENDING. However, there have been reports of jobs getting stuck in PENDING if the runtime_env setup hangs or fails in certain ways, making the job unresponsive to stop commands—this is not intended behavior and has been addressed in recent Ray releases (see this issue and fix).

To control retries for tasks, use the max_retries argument in @ray.remote or set the RAY_TASK_MAX_RETRIES environment variable. For jobs, the max_retries field in the job config controls how many times the job is retried after a failure, but this does not directly affect runtime_env installation retries. If you are seeing jobs stuck in PENDING due to runtime_env failures, upgrading to Ray 2.49 or later is recommended, as this includes fixes to ensure jobs fail properly after a timeout or repeated setup failures (details, docs).

Would you like more detail on how to set these parameters or debug runtime_env issues?

Sources:

Hint: Mention @RunLLM in the post for followups.