I create a new ray cluster in Kubernetes with following command :
helm install raycluster kuberay/ray-cluster --version 1.1.0 -f values.yaml --debug
I have the following values.yaml file.
image:
repository: rayproject/ray
tag: 2.10.0
pullPolicy: IfNotPresent
nameOverride: "kuberay"
fullnameOverride: ""
imagePullSecrets: []
common:
containerEnv: {}
head:
rayVersion: 2.10.0
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 20
imagePullPolicy: Always
securityContext: {}
env: []
envFrom: []
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
labels: {}
serviceAccountName: ""
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: 0
containerEnv: []
envFrom: []
resources:
limits:
cpu: "1"
memory: "2G"
requests:
cpu: "1"
memory: "2G"
annotations: {}
nodeSelector: {}
tolerations:
- effect: NoSchedule
key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
securityContext: {}
volumes:
- name: log-volume
emptyDir: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
sidecarContainers: []
command: []
args: []
headService: {}
worker:
groupName: workergroup
replicas: 1
minReplicas: 0
maxReplicas: 3
labels: {}
serviceAccountName: ""
rayStartParams:
resources: '"{\"default-worker-group-node\": 1}"'
containerEnv: []
envFrom: []
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "1"
memory: "1G"
annotations: {}
nodeSelector: {}
tolerations:
- effect: NoSchedule
key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
securityContext: {}
volumes:
- name: log-volume
emptyDir: {}
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
sidecarContainers: []
command: []
args: []
service:
type: LoadBalancer
I want to understand why does the following code launches 2 worker pods ?
The custom resource requirement by a task is 0.2 only but the worker pods start with 1 unit of the custom resource (default-worker-group-node).
The same code finishes the execution in a single pod when i decorate the tasks with @ray.remote(num_cpus=0.2)
Is this the correct behaviour when using custom resource for scheduling ?
import time
import ray
ray.init(address="ray://<ray-head-service-ip>:10001")
print(ray.cluster_resources())
@ray.remote(resources={"default-worker-group-node": 0.2})
def hello_world():
time.sleep(130)
return "Local machine says hello to the remote cluster "
@ray.remote(resources={"default-worker-group-node": 0.2})
def hello_world_small():
time.sleep(60)
return "Local machine says hello to the remote cluster : hello_world_small"
start = time.time()
a = hello_world.remote()
time.sleep(30)
b = hello_world_small.remote()
print(ray.get(a))
print(ray.get(b))
end = time.time()
print("Program ran for", end-start, "seconds")
ray.shutdown()