Ray cluster details doesn't show requested number of gpus

ray11 · April 30, 2024, 3:00pm

Ray cluster dashboard only shows 2 GPUs when I have requested 4 GPUs for each worker node. Upon executing nvidia-smi inside the worker node, it shows 4 GPUs. What am I missing here?

# This is a RayCluster configuration for PyTorch image training benchmark with a 1Gi training set. 
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: 'latest'
  ######################headGroupSpec#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block ...
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        containers:
        # The Ray head pod
        - name: ray-head
          image: rayproject/ray-ml:2.12.0.c2a961-cpu
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "4"
              memory: "24G"
            requests:
              cpu: "4"
              memory: "12G"
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 2
    minReplicas: 1
    maxReplicas: 300
    # logical group name, for this called small-group, also can be functional
    groupName: small-group
    rayStartParams:
      num-gpus: "1"
    #pod template
    template:
      metadata:
        labels:
          key: value
        # annotations for pod
        annotations:
          key: value
      spec:
        containers:
        - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray-ml:2.12.0.c2a961-gpu
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "8"
              memory: "24G"
              nvidia.com/gpu: 4
            requests:
              cpu: "4"
              memory: "12G"
              nvidia.com/gpu: 4

A job with scaling config that requires more than 2 GPUs doesn’t get placed

Sam_Chan · May 1, 2024, 8:00pm

Can you post screenshot for Ray Cluster as well as how you configured your resources in Python?

ray11 · May 2, 2024, 7:55pm

My understanding is that by default each worker gets 1 GPU if use_gpu is set to True. I have tried to add more workers > 2 and also set GPUs to 2 (resources per worker) for 2 workers but both scenarios fails. The job doesn’t get placed.

Python resource config:

scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 3})
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)

andrewsy · June 19, 2024, 4:28pm

The RayCluster you shared is specifying num-gpus=1 in the workerGroupSpecs[0].rayStartParams field. This is likely why the Ray dashboard only shows 2 GPUs (2 replicas with 1 GPUs each).

In general when using KubeRay, you don’t need to specify num-gpus as KubeRay will automatically set this to the correct value based on the GPU resource requests/limits

Topic		Replies	Views
3 workers but only 1 available	4	367	June 8, 2023
My dashboard says GPU [0] NA Kubernetes	2	729	April 25, 2022
Ray failing to find 4 V100 gpus on node Ray Core	4	362	May 23, 2022
Why is it looking for the GPU of other nodes? Ray Serve	2	40	April 5, 2025
Is it possible to know how many GPUs are available on the ray cluster on the driver side, prior to allocating work load? Ray Core	0	286	October 20, 2021

Ray cluster details doesn't show requested number of gpus

Related topics