How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I have a cluster where I have two different types of worker nodes, one kind of nodes on CPU only tasks like featurization while an other kind of nodes for GPU tasks like training of models. For each kind of node, I also have its own kind of docker images. Here is a sample cluster config file:
cluster_name: ray-cluster
provider:
type: aws
region: us-east-2
availability_zone: us-east-2a,us-east-2c,us-east-2b
max_workers: 30
docker:
# using featurization image as default image
image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/feat-image"
container_name: "ray-feat-worker"
pull_before_run: True
available_node_types:
ray_head_default:
resources: {"CPU": 2}
node_config:
InstanceType: t3.medium
IamInstanceProfile:
Arn: "<iam-profile>"
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 150
c5_cpu_16_spot:
resources: {"CPU": 16, "GPU": 0}
node_config:
InstanceType: c5.4xlarge
IamInstanceProfile:
Arn: "<iam-profile>"
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: "0.3"
min_workers: 0
max_workers: 10
single_gpu_spot:
resources: {"CPU": 4, "GPU": 1}
docker:
# using an different image for training
worker_image: "<account-id>.dkr.ecr.us-east-2.amazonaws.com/train-image"
node_config:
InstanceType: g4dn.xlarge
IamInstanceProfile:
Arn: "<iam-profile>"
InstanceMarketOptions:
MarketType: spot
min_workers: 0
max_workers: 10
head_node_type: ray_head_default
head_start_ray_commands:
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
Issue: when I am launching a training task (a job using GPU), the job is not getting launched, though the nodes are getting launched. A couple of questions: can we use different images for different nodes? Is my cluster config file correct if the answer is yes to former qn?