How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I don’t have a good understanding of how available_node_types is supposed to work, when you have more than one.
I want to use spot instances as much as possible, but for ray cluster to use up to a certain number of non-spot instances as a fallback, in case the desired # of spot instances is not available. So, I created 2 node types in the available_node_types list.
The problem is that the cluster manager never tries to launch any non-spot instances. Instead it retries launching spot instances, which typically does not succeed.
Here is a a stripped down snippet from my yaml file
available_node_types:
ray.head.default:
# omitting this because it works fine
ray.worker.nonspot_256:
# big non-spot instances, in case we can't get enough spot instances
min_workers: 0
max_workers: 8 # limiting these because they're expensive
resources: {"object_store_memory": 100000000}
node_config:
InstanceType: x2iedn.2xlarge # 256 GB ram
# omitting aws details which should not be relevant
ray.worker.spot_256:
min_workers: 0
max_workers: 14 # usually can't get more than this many x2iedn.2xlarge instances
resources: {"object_store_memory": 100000000}
node_config:
InstanceType: x2iedn.2xlarge # 256 GB ram
InstanceMarketOptions:
MarketType: spot