However, I noticed that this image already has a particular accelerator type configured here.
Since I have a different accelerator, I’d like to change that parameter. Is doing so using rayStartParams possible or do I need to rebuild the ray-llm with a different set of params?
You’d need to do 2 things - 1. set the correct resources when using raystart. 2. update the yaml to point to the new resources you want to use. You can pass in the updated yaml either through the docker image or through runtime environments.
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 0
maxReplicas: 1
# logical group name, for this called small-group, also can be functional
groupName: gpu-group
rayStartParams:
num-gpus: "4"
resources: '"{\"accelerator_type_xxx\": 2}"'
This didn’t work.
Also, ray stop will stop the entire cluster? Is it possible to stop just one job and restart it with a different set of parameters from the head node as opposed to doing it in the kuberay CRD/pod-template?
1Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster
That’s the error I see on the ray dashboard.
$ ray status
======== Autoscaler status: 2024-01-25 11:18:33.165916 ========
Node status
---------------------------------------------------------------
Active:
1 node_e247692229da368a65ec11cda51ed67cb797042d033088513145c768
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
2.0/2.0 CPU
0.0/2.0 accelerator_type_cpu
0B/8.00GiB memory
44B/2.17GiB object_store_memory
Demands:
{'CPU': 1.0, 'accelerator_type_a10': 0.01}: 1+ pending tasks/actors (1+ using placement groups)
{'accelerator_type_a10': 0.01, 'CPU': 1.0} * 1, {'CPU': 8.0, 'accelerator_type_a10': 0.01, 'GPU': 1.0} * 1 (STRICT_PACK): 1+ pending placement groups
Even if I work around the init container problem, it’s not clear how I can fix the GPU type a10 mismatch. Do you have instructions to stop a ray application and restart it with a different parameter/gpu_type without having to stop/restart a ray cluster via kuberay?
@Sihan_Wang I was able to add a GPU worker node to the cluster. The only issue is that it’s not a a10.
My question is:
How do I stop the existing application “router” and start it with the right GPU type? Is it possible to to edit the deployment config of an application in the “DEPLOYING” state?
Stopping the application and restarting with a different parameter is also fine. But I can’t seem to find docs on how to do that from the head node using the CLI.