Im trying to run this KubeRay Serve example. However, with the default settings(ray-service.vllm.yaml), this error occurs upon deployment:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 5000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
What I tried was to set this arg through the serve config like this:
spec:
serveConfigV2: |
applications:
- name: llm
route_prefix: /
import_path: ray-operator.config.samples.vllm.serve:model
args:
dtype: "float16"
deployments:
- name: VLLMDeployment
num_replicas: 1
ray_actor_options:
num_cpus: 6
# NOTE: num_gpus is set automatically based on TENSOR_PARALLELISM
runtime_env:
working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
pip: ["vllm==0.5.4"]
env_vars:
MODEL_ID: "meta-llama/Meta-Llama-3-8B-Instruct"
TENSOR_PARALLELISM: "2"
PIPELINE_PARALLELISM: "1"
which results in the this error instead:
ValueError: Arguments can only be passed to an application builder function, not an already built application.
How do I run the example with weak gpus ?