Using different worker images in ray cluster

arunppsg · May 8, 2024, 6:07am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have a cluster where I have two different types of worker nodes, one kind of nodes on CPU only tasks like featurization while an other kind of nodes for GPU tasks like training of models. For each kind of node, I also have its own kind of docker images. Here is a sample cluster config file:

cluster_name: ray-cluster

provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2c,us-east-2b

max_workers: 30

docker:
  # using featurization image as default image
  image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/feat-image"
  container_name: "ray-feat-worker"
  pull_before_run: True

available_node_types:
    ray_head_default:
        resources: {"CPU": 2}
        node_config:
            InstanceType: t3.medium
            IamInstanceProfile:
                Arn: "<iam-profile>"
            BlockDeviceMappings:
              - DeviceName: /dev/sda1
                Ebs:
                    VolumeSize: 150

    c5_cpu_16_spot:
        resources: {"CPU": 16, "GPU": 0}
        node_config:
            InstanceType: c5.4xlarge
            IamInstanceProfile:
                Arn: "<iam-profile>"
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: "0.3"
        min_workers: 0
        max_workers: 10

    single_gpu_spot:
        resources: {"CPU": 4, "GPU": 1}
        docker:
            # using an different image for training
            worker_image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/train-image"
        node_config:
            InstanceType: g4dn.xlarge
            IamInstanceProfile:
                Arn: "<iam-profile>"
            InstanceMarketOptions:
                MarketType: spot
        min_workers: 0
        max_workers: 10

head_node_type: ray_head_default

head_start_ray_commands:
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Issue: when I am launching a training task (a job using GPU), the job is not getting launched, though the nodes are getting launched. A couple of questions: can we use different images for different nodes? Is my cluster config file correct if the answer is yes to former qn?

chitramishra · May 8, 2024, 10:47am

Hello arunppsg,
As per my knowledge for solving this you can follow these steps:
To address the issue of GPU tasks not launching despite nodes being provisioned, follow these steps:

Confirm Docker Image Compatibility
Validate Cluster Configuration:
Check Node Accessibility:
Debug Task Launching Process
Test Task Launching

By following these steps, you can troubleshoot and resolve the issue of GPU tasks not launching on the Ray cluster effectively. This will ensure smooth operation and optimal utilization of resources for task execution.

I hope this will help you.

Thanks.

arunppsg · May 16, 2024, 3:06pm

cc @Jules_Damji ? Apologies for tagging

Topic		Replies	Views
Ray Image does not seems to have python only when used in aws cluster Ray Clusters	0	221	October 23, 2023
On-premise cluster: different worker node types Ray Clusters	5	879	June 16, 2023
Ray cluster's worker node is pending Ray Clusters	2	1228	February 8, 2022
Ray cluster Vertex AI: raylet has lagging heartbeats due to slow network or busy workload Ray Core	2	101	October 17, 2024
Different setup commands for different kind of workers Ray Clusters	2	503	March 23, 2023

Using different worker images in ray cluster

Related topics