Overriding resources per worker in ray-llm

asharma · January 22, 2024, 6:25pm

Hello:

Looking to experiment with ray-llm, I created a ray cluster with a pod template that referred to

      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: anyscale/ray-llm:latest

However, I noticed that this image already has a particular accelerator type configured here.

Since I have a different accelerator, I’d like to change that parameter. Is doing so using rayStartParams possible or do I need to rebuild the ray-llm with a different set of params?

I’m using the RayService CRD using kuberay.

Akshay_Malik · January 22, 2024, 10:13pm

You’d need to do 2 things - 1. set the correct resources when using raystart. 2. update the yaml to point to the new resources you want to use. You can pass in the updated yaml either through the docker image or through runtime environments.

asharma · January 22, 2024, 10:46pm

Do you have an example I can look at?

    workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 0
      maxReplicas: 1
      # logical group name, for this called small-group, also can be functional
      groupName: gpu-group
      rayStartParams:
        num-gpus: "4"
        resources: '"{\"accelerator_type_xxx\": 2}"'

This didn’t work.

Also, ray stop will stop the entire cluster? Is it possible to stop just one job and restart it with a different set of parameters from the head node as opposed to doing it in the kuberay CRD/pod-template?

Sihan_Wang · January 23, 2024, 5:12am

Hi @asharma , your template looks correct! can you paste what error did you get? And also ssh into the pod, and paste the output of cli ray status.

asharma · January 25, 2024, 7:20pm

1Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster

That’s the error I see on the ray dashboard.


$ ray status
======== Autoscaler status: 2024-01-25 11:18:33.165916 ========
Node status
---------------------------------------------------------------
Active:
 1 node_e247692229da368a65ec11cda51ed67cb797042d033088513145c768
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 2.0/2.0 CPU
 0.0/2.0 accelerator_type_cpu
 0B/8.00GiB memory
 44B/2.17GiB object_store_memory

Demands:
 {'CPU': 1.0, 'accelerator_type_a10': 0.01}: 1+ pending tasks/actors (1+ using placement groups)
 {'accelerator_type_a10': 0.01, 'CPU': 1.0} * 1, {'CPU': 8.0, 'accelerator_type_a10': 0.01, 'GPU': 1.0} * 1 (STRICT_PACK): 1+ pending placement groups

asharma · January 25, 2024, 7:28pm

On the head node:

(base) ray@ray-llm-raycluster-9s78h-head-8m4lw:~$ ray health-check --address ray-llm-raycluster-9s78h-head-svc.default.svc.cluster.local:6379
(base) ray@ray-llm-raycluster-9s78h-head-8m4lw:~$ echo $?
0

But the GPU worker machine is stuck in “waiting_for_gcs” state. It seems to be the same problem described here:

https://docs.ray.io/en/latest/cluster/kubernetes/troubleshooting/troubleshooting.html#disable-the-init-container-injection

Even if I work around the init container problem, it’s not clear how I can fix the GPU type a10 mismatch. Do you have instructions to stop a ray application and restart it with a different parameter/gpu_type without having to stop/restart a ray cluster via kuberay?

asharma · January 30, 2024, 10:01pm

@Sihan_Wang I was able to add a GPU worker node to the cluster. The only issue is that it’s not a a10.

My question is:

How do I stop the existing application “router” and start it with the right GPU type? Is it possible to to edit the deployment config of an application in the “DEPLOYING” state?

Stopping the application and restarting with a different parameter is also fine. But I can’t seem to find docs on how to do that from the head node using the CLI.

asharma · January 30, 2024, 11:02pm

Please consider this resolved. The sequence I was looking for:

serve shutdown
edit <deployment config>
serve start

I was looking for this functionality in the ray CLI, but it’s in a separate tool.

Would love to be able to do this on a per application basis, instead of shutting down all of them.

Topic		Replies	Views
Download an opensource LLM model in Raycluster yaml file?	2	263	December 14, 2023
Specify workerPodType in Helm chart values.yaml Ray Clusters	4	726	June 17, 2022
About the Ray Clusters category Ray Clusters	2	1081	July 22, 2022
Template change for workers should trigger a pod recreate Kubernetes	1	456	January 20, 2023
Tasks with a fractional Custom Resource requirement always launch a new pod Kubernetes	0	20	April 10, 2024

Overriding resources per worker in ray-llm

Related topics