High: Completely blocks me.
2. Environment:
- Ray version: rayproject/ray:2.41.0
- Python version:
- OS: ubuntu
- Cloud/Infrastructure: aws
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
-
Expected: As the load increases, autoscaling should get kick started and replica must increase from min 1 to max 10
-
Actual: there is no scaling up of replicas happening even if I send large amount of data
I am deploying multiplexed replicas
ray_inference.py
@serve.deployment(
name="inference",
ray_actor_options={"num_cpus": 0.2,
"num_gpus": 0.1})
class Inference:
def __init__(self):
self.model_path = MODEL_DIR
@serve.multiplexed(max_num_models_per_replica=2)
async def get_model_runtime(self, model_id):
# Loading model
async def __call__(self, request):
model_id = serve.get_multiplexed_model_id()
# Running inference and returning response
Kubernetes yaml file used with helm for raycluster
apiVersion: ray.io/v1
kind: RayService
metadata:
name: inference
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: inference_worker
route_prefix: /ray_inference
deployments:
- name: inference
num_replicas: auto
max_ongoing_requests: 40
ray_actor_options:
num_cpus: 0.1
num_gpus: 0.1
autoscaling_config:
min_replicas: 1
max_replicas: 10
target_ongoing_requests: 30
rayClusterConfig:
rayVersion: '2.41.0' # Should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-head
image: <custom image imported from ray>
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: {{ .Values.rayservice.containerPort }}
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: 2
memory: "4G"
requests:
cpu: 2
memory: "4G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 2
groupName: wgpu
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: <custom image imported from ray>
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: 4
memory: "24G"
nvidia.com/gpu: 1
requests:
cpu: 4
memory: "24G"
nvidia.com/gpu: 1
readinessProbe:
exec:
command:
- bash
- -c
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
initialDelaySeconds: 10
timeoutSeconds: 2
periodSeconds: 5
failureThreshold: 5
livenessProbe:
exec:
command:
- bash
- -c
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
initialDelaySeconds: 30
timeoutSeconds: 2
periodSeconds: 5
failureThreshold: 5
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: Exists
- key: "pool"
value: "gpu"
operator: "Equal"
effect: "NoSchedule"
- key: "type"
value: "{{ .Values.rayinference.gputype | default "a10g" }}"
operator: "Equal"
effect: "NoSchedule"
nodeSelector:
"pool": "gpu"
"type": {{ .Values.rayinference.gputype | default "a10g" }}
volumes:
- name: {{ .Chart.Name }}-global
configMap:
name: {{ .Release.Name }}-configmap
I have tried with different config for max_ongoing_requests and target_ongoing_requests(less than the max_req) but behaviour has been the same.
As seen in the ray dashboard image, there are wanring claims of resource not being available, but resource status tell otherwise