I have installed the Ray Cluster in EKS using Ray Operator. The operator image is rayproject/ray:1.11.0-py38.
I have installed with 2 setup. One with 1 head node and worker configured to be autoscaled with 0 to 50. In this setup I can see that no workers nodes are created and all the computation happens on the head node. Even when I set rayResource to 0, all the actors/jobs are in pending state. Same happens when the workers are set to be autoscaled from 1 to 50. No worker nodes are created. I checked the autoscaler logs and found the following logs.
Resources
---------------------------------------------------------------
Usage:
0.0/2.0 CPU
0.00/2.877 GiB memory
0.00/0.673 GiB object_store_memory
Demands:
{}: 2+ pending tasks/actors
But when I use the image rayproject/ray:v1.11.0, it works fine. But this image have the python 3.7 where as the application I am working on requires the python to be 3.8.x.
Alternatively, I have replaced the ray operator with kuberay but the issue is same. I used the following container spec and the results is same.
- name: autoscaler
image: rayproject/ray:d3159f-py38
imagePullPolicy: IfNotPresent
env:
- name: RAY_CLUSTER_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RAY_CLUSTER_NAME
value: prediction-ray-cluster
command: ["ray"]
args:
- "kuberay-autoscaler"
- "--cluster-name"
- "$(RAY_CLUSTER_NAME)"
- "--cluster-namespace"
- "$(RAY_CLUSTER_NAMESPACE)"
resources:
limits:
cpu: "500m"
memory: "1024Mi"
requests:
cpu: "250m"
memory: "512Mi"
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
Am I missing some configuration? I have tried this a lot but I am stuck on how to proceed.
Thank You