Dependency loading appears to race worker startup on Ray running on GKE with KubeRay and uv, leading to missing modules and .so errors.

Warlink · December 27, 2025, 9:38pm

Hi all,

I am running into what looks like a dependency race condition when running Ray on GKE using KubeRay and would appreciate any insight.

When training a small toy model from the Ray documentation, workers appear to begin execution before the Python environment is fully established. I am using uv for dependency management and setting working_dir to the project root so that the pyproject.toml is packaged and shipped to the Ray cluster.

The failures present in a few different ways. Sometimes workers fail with import errors such as ImportError: Can’t import ray.train. Other times the failure is a missing shared library, most commonly a missing .so file for PyTorch. I also frequently see worker startup errors like:

The worker group startup timed out after 30.0 seconds waiting for 1 workers.

Increasing RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S helps in some cases. Setting it to around 10 minutes allows workers enough time to finish setting up their environment before training starts. However, this has not been necessary on other clusters I have worked with that were not self managed.

This workaround does not help when using Ray Data, where I see the same dependency-related failures.

For context, this is all running on the latest version of Ray, using the official Ray Docker images, and permissions and autoscaling appear to be configured correctly.

Has anyone encountered a similar issue on KubeRay and GKE? If so, what configuration changes or best practices helped ensure dependencies were fully available before worker startup?

Thanks in advance for any guidance.

Topic		Replies	Views
Problems lauching gcp cluster Ray Core	4	754	July 7, 2022
Raylet worker processes are failing Ray Core	3	433	March 5, 2025
Error running tune on k8s Kubernetes	4	499	June 28, 2021
CUDA-capable device(s) is/are busy or unavailable Ray Clusters	1	967	February 1, 2023
Ray k8s cluster, cannot run new task when previous task failed Kubernetes	61	3582	July 7, 2022

Dependency loading appears to race worker startup on Ray running on GKE with KubeRay and uv, leading to missing modules and .so errors.

Related topics