Set dfloat arg for KubeRay vLLM example

marvin.steinke · September 11, 2024, 8:49am

Im trying to run this KubeRay Serve example. However, with the default settings(ray-service.vllm.yaml), this error occurs upon deployment:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 5000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

What I tried was to set this arg through the serve config like this:

spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ray-operator.config.samples.vllm.serve:model
      args:
        dtype: "float16"
      deployments:
      - name: VLLMDeployment
        num_replicas: 1
        ray_actor_options:
          num_cpus: 6
          # NOTE: num_gpus is set automatically based on TENSOR_PARALLELISM
      runtime_env:
        working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
        pip: ["vllm==0.5.4"]
        env_vars:
          MODEL_ID: "meta-llama/Meta-Llama-3-8B-Instruct"
          TENSOR_PARALLELISM: "2"
          PIPELINE_PARALLELISM: "1"

which results in the this error instead:

ValueError: Arguments can only be passed to an application builder function, not an already built application.

How do I run the example with weak gpus ?

Topic		Replies	Views
Are there any examples of ray vllm for offline local model calls? Configure Algorithm, Training, Evaluation, Scaling	1	102	February 13, 2025
Ray Serve LLM example in document cannot work Ray Serve LLM APIs	6	218	April 3, 2025
vLLM Inferencing on multiGPU Ray Serve	7	962	September 24, 2024
vLLM example not working in Docker on VM	1	489	September 4, 2024
Unable to specify GPU based on number	0	680	February 23, 2024

Set dfloat arg for KubeRay vLLM example

Related topics