How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I don’t have a good understanding of how available_node_types is supposed to work, when you have more than one.
I want to use spot instances as much as possible, but for ray cluster to use up to a certain number of non-spot instances as a fallback, in case the desired # of spot instances is not available. So, I created 2 node types in the available_node_types list.
The problem is that the cluster manager never tries to launch any non-spot instances. Instead it retries launching spot instances, which typically does not succeed.
Here is a a stripped down snippet from my yaml file
available_node_types:
ray.head.default:
# omitting this because it works fine
ray.worker.nonspot_256:
# big non-spot instances, in case we can't get enough spot instances
min_workers: 0
max_workers: 8 # limiting these because they're expensive
resources: {"object_store_memory": 100000000}
node_config:
InstanceType: x2iedn.2xlarge # 256 GB ram
# omitting aws details which should not be relevant
ray.worker.spot_256:
min_workers: 0
max_workers: 14 # usually can't get more than this many x2iedn.2xlarge instances
resources: {"object_store_memory": 100000000}
node_config:
InstanceType: x2iedn.2xlarge # 256 GB ram
InstanceMarketOptions:
MarketType: spot
You can do something like that with KubeRay if your Kubernetes cluster is appropriately configured. For that to work you need to 1) configure different worker groups in RayCluster resource using different node selectors for pods, and 2) have your Kubernetes cluster able to spawn different types of nodes based on pods’ node selectors. For example, if you use AWS EKS with Karpenter, you can use karpenter.sh/capacity-type: spot node selector on the worker pods to get spot instances added to your cluster (see this blog for details). Similar approaches are available on other cloud providers.
Could you clarify which specific feature you are referring to? Your question does not specify a feature, and Ray has many components and ongoing updates. Please provide more details or context (e.g., a feature name, API, or link to a previous discussion) so I can give you an accurate answer with the latest information and sources.
@RunLLM The feature mentioned in the original post : being able to have ondemand fallback to spot instances or more generally, being able to prioritize some instances types
Yes, this feature is now available on Anyscale. You can configure worker nodes to prefer spot instances but fall back to on-demand, and also set custom prioritization rules for worker groups (e.g., prefer reserved, then spot, then on-demand). This is done using the instance_ranking_strategy and related flags in your compute config. Price-based ranking and custom group order are both supported, allowing you to control instance selection and fallback behavior in detail. See the official docs for configuration examples and details on these strategies.
For open source Ray (not Anyscale), native fallback from spot to on-demand is not fully supported; the autoscaler will keep retrying the same node type if spot capacity is unavailable, rather than automatically falling back to on-demand. This limitation is discussed in community threads and GitHub issues, and is a key differentiator between Anyscale and open source Ray in this area. Would you like more detail or example config snippets?
It could be possible that your total cost when using Anyscale with AWS depends on several factors: Anyscale platform fees, the AWS on-demand or spot pricing for your chosen instance types (g4dn.xlarge and m5.large), and any additional AWS resources (e.g., storage, load balancers, Redis/Memorystore for head node fault tolerance). Anyscale charges based on the type and duration of compute resources used, and AWS charges separately for the underlying infrastructure. You can view detailed usage and cost breakdowns in the Anyscale usage dashboard, but exact pricing is not specified in the documentation and may require checking both the Anyscale pricing page and the AWS pricing calculator.
Though I’m not entirely certain, it might be that you will also incur costs for required AWS services like MemoryDB or load balancers if you enable certain Anyscale features. For precise estimates, you should consult both Anyscale and AWS pricing resources, and use the Anyscale usage dashboard for real-time tracking. Would you like more detail on how to estimate or monitor these costs?