How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
TL;DR: I am unable to run the sample RayJob (or any RayJob) because Ray starts but never submits the job. I believe this is do to no workers being available but I am not sure how to fix that.
I am trying to follow this simple RayJob example from the docs.
Ray Version: 2.9.0
Kuberay Operator Version: 1.1.0
K8s Version: 1.29
Environment: AWS EKS
What I did:
- Set up my EKS cluster with 2 node groups, 1 of them has GPUs. I followed the instructions from the docs.
eksctl get nodegroup --cluster test_cluster
CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE TYPE
test_cluster test_cpu_nodegroup ACTIVE 2024-01-17T16:32:45Z 0 1 1 m5.xlarge AL2_x86_64 managed
test_cluster test_gpu_nodegroup ACTIVE 2024-01-22T20:10:38Z 0 5 1 g5.2xlarge BOTTLEROCKET_x86_64_NVIDIA managed
- Installed the Kuberay Operator Helm chart, according to the docs Step 2.
- Applied the sample RayJob according to the docs Step 3:
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.1.0/ray-operator/config/samples/ray-job.sample.yaml
- Ran the commands in step 4 of the docs:
kubectl get rayjob
NAME AGE
rayjob-sample 22m
kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
rayjob-sample-raycluster-fmw7d 1 ready 30m
kubectl get pods
NAME READY STATUS RESTARTS AGE
kuberay-operator-5488cc8c8c-45mt7 1/1 Running 0 74m
rayjob-sample-raycluster-fmw7d-head-2f7sk 1/1 Running 0 31m
As you can see, no workers are created to run the actual job despite the cluster starting. Does anyone have any advice or tips on how I can find out why this is the case?