Unable to run Sample RayJob on EKS: No Available Workers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

TL;DR: I am unable to run the sample RayJob (or any RayJob) because Ray starts but never submits the job. I believe this is do to no workers being available but I am not sure how to fix that.

I am trying to follow this simple RayJob example from the docs.

Ray Version: 2.9.0
Kuberay Operator Version: 1.1.0
K8s Version: 1.29
Environment: AWS EKS

What I did:

  1. Set up my EKS cluster with 2 node groups, 1 of them has GPUs. I followed the instructions from the docs.
eksctl get nodegroup --cluster test_cluster
CLUSTER      NODEGROUP            STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE                                 TYPE
test_cluster test_cpu_nodegroup   ACTIVE  2024-01-17T16:32:45Z    0               1               1                       m5.xlarge       AL2_x86_64                   managed
test_cluster test_gpu_nodegroup   ACTIVE  2024-01-22T20:10:38Z    0               5               1                       g5.2xlarge      BOTTLEROCKET_x86_64_NVIDIA   managed
  1. Installed the Kuberay Operator Helm chart, according to the docs Step 2.
  2. Applied the sample RayJob according to the docs Step 3:
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.1.0/ray-operator/config/samples/ray-job.sample.yaml
  1. Ran the commands in step 4 of the docs:
kubectl get rayjob
NAME            AGE
rayjob-sample   22m

kubectl get raycluster
NAME                             DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
rayjob-sample-raycluster-fmw7d   1                                     ready    30m

kubectl get pods
NAME                                        READY   STATUS    RESTARTS   AGE
kuberay-operator-5488cc8c8c-45mt7           1/1     Running   0          74m
rayjob-sample-raycluster-fmw7d-head-2f7sk   1/1     Running   0          31m

As you can see, no workers are created to run the actual job despite the cluster starting. Does anyone have any advice or tips on how I can find out why this is the case?

Sure enough, I am able to run it locally with Kind, so I am guessing I must be missing something with my EKS configuration.

I was able to find the following in the KubeRay Operator logs:

{
  "level": "error",
  "ts": "2024-05-13T21:01:12.397Z",
  "logger": "controllers.RayJob",
  "msg": "Failed to get job info",
  "RayJob": { "name": "rayjob-sample", "namespace": "default" },
  "reconcileID": "579c7e3e-5232-4432-a961-df3c3538d70b",
  "JobId": "rayjob-sample-n4lr2",
  "error": "Job rayjob-sample-n4lr2 does not exist on the cluster",
  "stacktrace": "github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayjob_controller.go:223\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"
}

Downgrading KubeRay Operator from 1.1.0 → 1.0.0 fixed the issue. No idea why.