Ray cluster details doesn't show requested number of gpus

Ray cluster dashboard only shows 2 GPUs when I have requested 4 GPUs for each worker node. Upon executing nvidia-smi inside the worker node, it shows 4 GPUs. What am I missing here?

# This is a RayCluster configuration for PyTorch image training benchmark with a 1Gi training set. 
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: 'latest'
  ######################headGroupSpec#################################
  # head group template and specs, (perhaps 'group' is not needed in the name)
  headGroupSpec:
    # logical group name, for this called head-group, also can be functional
    # pod type head or worker
    # rayNodeType: head # Not needed since it is under the headgroup
    # the following params are used to complete the ray start: ray start --head --block ...
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        containers:
        # The Ray head pod
        - name: ray-head
          image: rayproject/ray-ml:2.12.0.c2a961-cpu
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "4"
              memory: "24G"
            requests:
              cpu: "4"
              memory: "12G"
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 2
    minReplicas: 1
    maxReplicas: 300
    # logical group name, for this called small-group, also can be functional
    groupName: small-group
    rayStartParams:
      num-gpus: "1"
    #pod template
    template:
      metadata:
        labels:
          key: value
        # annotations for pod
        annotations:
          key: value
      spec:
        containers:
        - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray-ml:2.12.0.c2a961-gpu
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "8"
              memory: "24G"
              nvidia.com/gpu: 4
            requests:
              cpu: "4"
              memory: "12G"
              nvidia.com/gpu: 4

A job with scaling config that requires more than 2 GPUs doesn’t get placed

Can you post screenshot for Ray Cluster as well as how you configured your resources in Python?

My understanding is that by default each worker gets 1 GPU if use_gpu is set to True. I have tried to add more workers > 2 and also set GPUs to 2 (resources per worker) for 2 workers but both scenarios fails. The job doesn’t get placed.

Python resource config:

scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 3})
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)

The RayCluster you shared is specifying num-gpus=1 in the workerGroupSpecs[0].rayStartParams field. This is likely why the Ray dashboard only shows 2 GPUs (2 replicas with 1 GPUs each).

In general when using KubeRay, you don’t need to specify num-gpus as KubeRay will automatically set this to the correct value based on the GPU resource requests/limits