Hi all,
I am running into what looks like a dependency race condition when running Ray on GKE using KubeRay and would appreciate any insight.
When training a small toy model from the Ray documentation, workers appear to begin execution before the Python environment is fully established. I am using uv for dependency management and setting working_dir to the project root so that the pyproject.toml is packaged and shipped to the Ray cluster.
The failures present in a few different ways. Sometimes workers fail with import errors such as ImportError: Can’t import ray.train. Other times the failure is a missing shared library, most commonly a missing .so file for PyTorch. I also frequently see worker startup errors like:
The worker group startup timed out after 30.0 seconds waiting for 1 workers.
Increasing RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S helps in some cases. Setting it to around 10 minutes allows workers enough time to finish setting up their environment before training starts. However, this has not been necessary on other clusters I have worked with that were not self managed.
This workaround does not help when using Ray Data, where I see the same dependency-related failures.
For context, this is all running on the latest version of Ray, using the official Ray Docker images, and permissions and autoscaling appear to be configured correctly.
Has anyone encountered a similar issue on KubeRay and GKE? If so, what configuration changes or best practices helped ensure dependencies were fully available before worker startup?
Thanks in advance for any guidance.