Overriding resources per worker in ray-llm


Looking to experiment with ray-llm, I created a ray cluster with a pod template that referred to

      #pod template
          - name: ray-head
            image: anyscale/ray-llm:latest

However, I noticed that this image already has a particular accelerator type configured here.

Since I have a different accelerator, I’d like to change that parameter. Is doing so using rayStartParams possible or do I need to rebuild the ray-llm with a different set of params?

I’m using the RayService CRD using kuberay.

You’d need to do 2 things - 1. set the correct resources when using raystart. 2. update the yaml to point to the new resources you want to use. You can pass in the updated yaml either through the docker image or through runtime environments.

Do you have an example I can look at?

    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 0
      maxReplicas: 1
      # logical group name, for this called small-group, also can be functional
      groupName: gpu-group
        num-gpus: "4"
        resources: '"{\"accelerator_type_xxx\": 2}"'

This didn’t work.

Also, ray stop will stop the entire cluster? Is it possible to stop just one job and restart it with a different set of parameters from the head node as opposed to doing it in the kuberay CRD/pod-template?

Hi @asharma , your template looks correct! can you paste what error did you get? And also ssh into the pod, and paste the output of cli ray status.

1Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster

That’s the error I see on the ray dashboard.

$ ray status
======== Autoscaler status: 2024-01-25 11:18:33.165916 ========
Node status
 1 node_e247692229da368a65ec11cda51ed67cb797042d033088513145c768
 (no pending nodes)
Recent failures:
 (no failures)

 2.0/2.0 CPU
 0.0/2.0 accelerator_type_cpu
 0B/8.00GiB memory
 44B/2.17GiB object_store_memory

 {'CPU': 1.0, 'accelerator_type_a10': 0.01}: 1+ pending tasks/actors (1+ using placement groups)
 {'accelerator_type_a10': 0.01, 'CPU': 1.0} * 1, {'CPU': 8.0, 'accelerator_type_a10': 0.01, 'GPU': 1.0} * 1 (STRICT_PACK): 1+ pending placement groups

On the head node:

(base) ray@ray-llm-raycluster-9s78h-head-8m4lw:~$ ray health-check --address ray-llm-raycluster-9s78h-head-svc.default.svc.cluster.local:6379
(base) ray@ray-llm-raycluster-9s78h-head-8m4lw:~$ echo $?

But the GPU worker machine is stuck in “waiting_for_gcs” state. It seems to be the same problem described here:


Even if I work around the init container problem, it’s not clear how I can fix the GPU type a10 mismatch. Do you have instructions to stop a ray application and restart it with a different parameter/gpu_type without having to stop/restart a ray cluster via kuberay?

@Sihan_Wang I was able to add a GPU worker node to the cluster. The only issue is that it’s not a a10.

My question is:

How do I stop the existing application “router” and start it with the right GPU type? Is it possible to to edit the deployment config of an application in the “DEPLOYING” state?

Stopping the application and restarting with a different parameter is also fine. But I can’t seem to find docs on how to do that from the head node using the CLI.

Please consider this resolved. The sequence I was looking for:

serve shutdown
edit <deployment config>
serve start

I was looking for this functionality in the ray CLI, but it’s in a separate tool.

Would love to be able to do this on a per application basis, instead of shutting down all of them.