Using different worker images in ray cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a cluster where I have two different types of worker nodes, one kind of nodes on CPU only tasks like featurization while an other kind of nodes for GPU tasks like training of models. For each kind of node, I also have its own kind of docker images. Here is a sample cluster config file:

cluster_name: ray-cluster

provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2c,us-east-2b

max_workers: 30

docker:
  # using featurization image as default image
  image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/feat-image"
  container_name: "ray-feat-worker"
  pull_before_run: True

available_node_types:
    ray_head_default:
        resources: {"CPU": 2}
        node_config:
            InstanceType: t3.medium
            IamInstanceProfile:
                Arn: "<iam-profile>"
            BlockDeviceMappings:
              - DeviceName: /dev/sda1
                Ebs:
                    VolumeSize: 150

    c5_cpu_16_spot:
        resources: {"CPU": 16, "GPU": 0}
        node_config:
            InstanceType: c5.4xlarge
            IamInstanceProfile:
                Arn: "<iam-profile>"
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: "0.3"
        min_workers: 0
        max_workers: 10

    single_gpu_spot:
        resources: {"CPU": 4, "GPU": 1}
        docker:
            # using an different image for training
            worker_image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/train-image"
        node_config:
            InstanceType: g4dn.xlarge
            IamInstanceProfile:
                Arn: "<iam-profile>"
            InstanceMarketOptions:
                MarketType: spot
        min_workers: 0
        max_workers: 10

head_node_type: ray_head_default

head_start_ray_commands:
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Issue: when I am launching a training task (a job using GPU), the job is not getting launched, though the nodes are getting launched. A couple of questions: can we use different images for different nodes? Is my cluster config file correct if the answer is yes to former qn?

Hello arunppsg,
As per my knowledge for solving this you can follow these steps:
To address the issue of GPU tasks not launching despite nodes being provisioned, follow these steps:

  1. Confirm Docker Image Compatibility
  2. Validate Cluster Configuration:
  3. Check Node Accessibility:
  4. Debug Task Launching Process
  5. Test Task Launching

By following these steps, you can troubleshoot and resolve the issue of GPU tasks not launching on the Ray cluster effectively. This will ensure smooth operation and optimal utilization of resources for task execution.

I hope this will help you.

Thanks.

cc @Jules_Damji ? Apologies for tagging