Why is it looking for the GPU of other nodes?

yczhangnaxin · April 3, 2025, 8:38am

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.40
Python version: 3.11
OS: ubuntu 22.04

I am using vllm + volcano + ray to make a distributed inference service for LLM. There are two nodes in my k8s cluster (named nodeA and nodeB). When I start the service, nodeA runs normally, but nodeB reports the following error:

nvidia-container-cli: device error: GPU-xxxx0cde: unknown device: unknown

After checking, the gpu is the gpu number of nodeA. How can I solve this problem?

here is my yaml file:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: qwen25-05b
  namespace: deployment-system
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  kuberay.ray-operator.config.samples.vllm.serve:model
      deployments:
      - name: VLLMDeployment
        num_replicas: 2
        ray_actor_options:
          num_cpus: 2
      runtime_env:
        env_vars:
          MODEL_ID: "/model"
          TENSOR_PARALLELISM: "1"
          PIPELINE_PARALLELISM: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        disable-usage-stats: "true"
        no-monitor: "true"
      template:
        metadata:
          annotations:
            scheduling.volcano.sh/queue-name: model-deploy
        spec:
          schedulerName: volcano
          volumes:
            - name: model
              nfs:
                server: nodeA
                path: /nfs/public/model/
            - name: kuberay
              nfs:
                server: nodeA
                path: /nfs/kuberay/kuberay-master
          containers:
          - name: ray-head
            image: docker.io/library/ray:2.40.0-torch2.5.1-cuda12.1-patch-fix
            resources:
              limits:
                cpu: "2"
                memory: "4Gi"
              requests:
                cpu: "2"
                memory: "4Gi"
            ports:
- containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            volumeMounts:
              - name: model
                mountPath: /model
              - name: kuberay
                mountPath: /kuberay
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: gpu-group
      rayStartParams:
        num-cpus: "2"
        num-gpus: "1"
        resources: '"{\"vgpu-memory\": 3000}"'
        disable-usage-stats: "true"
        no-monitor: "true"
      template:
        metadata:
          annotations:
            scheduling.volcano.sh/queue-name: model-deploy
        spec:
          schedulerName: volcano
          volumes:
            - name: model
              nfs:
                server: nodeA
                path: /nfs/public/model/
            - name: kuberay
              nfs:
                server: nodeA
                path: /nfs/kuberay/kuberay-master
          containers:
          - name: llm
            image: docker.io/library/ray:2.40.0-torch2.5.1-cuda12.1-patch-fix
            resources:
              limits:
                cpu: 2
                memory: "4Gi"
               volcano.sh/vgpu-number: 1
                volcano.sh/vgpu-memory: 3000
              requests:
                cpu: "2"
                memory: "4Gi"
                volcano.sh/vgpu-number: 1
                volcano.sh/vgpu-memory: 3000
            volumeMounts:
              - name: model
                mountPath: /model
              - name: kuberay
                mountPath: /kuberay
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: model-deploy
spec:
  weight: 1
  capability:
    cpu: 6
    memory: 12Gi
    volcano.sh/vgpu-number: 2
    volcano.sh/vgpu-number: 6000

Kai-Hsun_Chen · April 3, 2025, 6:17pm

Can you try to run nvidia-smi in the node B? If it can’t detect GPUs, you need to check whether the GPU driver is installed correctly or not.

yczhangnaxin · April 5, 2025, 1:02pm

here is status of nodeB

To be precise, it is not always nodeB that looks for nodeA, but when one node runs successfully, the other node (that is, the node that failed to run) will always look for the gpu device of the successfully running node.

Topic		Replies	Views
Replicas can't connect to GPUs Ray Serve	9	1652	August 11, 2022
My dashboard says GPU [0] NA Kubernetes	2	747	April 25, 2022
NVIDIA GPU not deteted RLlib	3	485	October 3, 2021
Ray failing to find 4 V100 gpus on node Ray Core	4	370	May 23, 2022
Ray Serve container runtime_env cannot use GPU Ray Serve	3	798	December 6, 2023

Why is it looking for the GPU of other nodes?

Related topics