Hi there, thanks for the great job!
I have a task that runs a quick inference of an ML model over the gpu. for this I set min_workers: 20
for my worker node. After ray initalizes all the worker nodes almost all of the stay idle (usually only 2-4 out of the 20 is not idle).
What seem to be the problem?
I also noticed that the head node is doing most of the workload, how can I do proper load balancing over all my nodes?
Thanks.
Thanks my config:
cluster_name: gpucluster
max_workers: 100
upscaling_speed: 2.0
idle_timeout_minutes: 10
docker:
image: "rayproject/ray:latest-gpu"
container_name: "ray_container"
provider:
type: gcp
region: ...
availability_zone: ...
project_id: ...
auth:
ssh_user: ray
available_node_types:
head_node:
min_workers: 0
max_workers: 0
resources: {"CPU": 4, "GPU": 1}
node_config:
machineType: n1-highmem-4
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
guestAccelerators:
- acceleratorType: .../nvidia-tesla-p100
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: "terminate"
- automaticRestart: true
worker_node:
min_workers: 20
resources: {"CPU": 4, "GPU": 1}
node_config:
machineType: n1-highmem-4
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
scheduling:
- preemptible: false
guestAccelerators:
- acceleratorType: .../nvidia-tesla-p100
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: "terminate"
- automaticRestart: true
head_node_type: head_node