Head pod stuck on pulling the image

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

When creating RayCluster on OpenShift and setting the ServiceAccount sometimes there is an issue with a race condition while using the integrated openshift registry. Because OpenShift adds imagePullSecrets to the ServiceAccount, sometimes KubeRay will create the HeadPod before the imagePullSecrets exist and will get stuck on pulling the image. Killing the head pod solves the issue. Anyone has any ideas how could this be fixed?

It seems that you need some delay in HeadPod startup. If actual delays are predictable and short (say, less than 5 seconds) then adding init container or a lifecycle hook with a sleep 5s command should do the trick. If the delays are unpredictable and may take long time, then looping kubectl -n <namespace> get secret/<secret-name> within the init-container for HeadPod is another option.

1 Like