Multiple available_node_types, some spot, some non-spot

xoffey · August 2, 2024, 9:00pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I don’t have a good understanding of how available_node_types is supposed to work, when you have more than one.

I want to use spot instances as much as possible, but for ray cluster to use up to a certain number of non-spot instances as a fallback, in case the desired # of spot instances is not available. So, I created 2 node types in the available_node_types list.

The problem is that the cluster manager never tries to launch any non-spot instances. Instead it retries launching spot instances, which typically does not succeed.

Here is a a stripped down snippet from my yaml file

available_node_types:
    ray.head.default:
        # omitting this because it works fine
    ray.worker.nonspot_256:
        # big non-spot instances, in case we can't get enough spot instances
        min_workers: 0
        max_workers: 8  # limiting these because they're expensive
        resources: {"object_store_memory": 100000000}
        node_config:
            InstanceType: x2iedn.2xlarge  # 256 GB ram
            # omitting aws details which should not be relevant
    ray.worker.spot_256:
        min_workers: 0
        max_workers: 14  # usually can't get more than this many x2iedn.2xlarge instances
        resources: {"object_store_memory": 100000000}

        node_config:
            InstanceType: x2iedn.2xlarge  # 256 GB ram
            InstanceMarketOptions:
                MarketType: spot

Sam_Chan · August 5, 2024, 6:50am

we actually have this as part of the hosted ray solution on anyscale (see here for docs on how that works)

xoffey · August 5, 2024, 4:43pm

does Anyscale cost money? I was looking for a way to do it with Ray Cluster.

Sam_Chan · August 6, 2024, 7:57am

There isn’t currently a way to do this on Ray Clusters explicitly; can you please create a feature request on Github?

lobanov · August 6, 2024, 2:01pm

You can do something like that with KubeRay if your Kubernetes cluster is appropriately configured. For that to work you need to 1) configure different worker groups in RayCluster resource using different node selectors for pods, and 2) have your Kubernetes cluster able to spawn different types of nodes based on pods’ node selectors. For example, if you use AWS EKS with Karpenter, you can use karpenter.sh/capacity-type: spot node selector on the worker pods to get spot instances added to your cluster (see this blog for details). Similar approaches are available on other cloud providers.

raphael · October 8, 2025, 1:42pm

Is this feature available now ? Thank you

RunLLM · October 8, 2025, 1:43pm

Could you clarify which specific feature you are referring to? Your question does not specify a feature, and Ray has many components and ongoing updates. Please provide more details or context (e.g., a feature name, API, or link to a previous discussion) so I can give you an accurate answer with the latest information and sources.

Sources:

Hint: Mention @RunLLM in the post for followups.

raphael · October 8, 2025, 2:00pm

@RunLLM The feature mentioned in the original post : being able to have ondemand fallback to spot instances or more generally, being able to prioritize some instances types

RunLLM · October 8, 2025, 2:01pm

Yes, this feature is now available on Anyscale. You can configure worker nodes to prefer spot instances but fall back to on-demand, and also set custom prioritization rules for worker groups (e.g., prefer reserved, then spot, then on-demand). This is done using the instance_ranking_strategy and related flags in your compute config. Price-based ranking and custom group order are both supported, allowing you to control instance selection and fallback behavior in detail. See the official docs for configuration examples and details on these strategies.

For open source Ray (not Anyscale), native fallback from spot to on-demand is not fully supported; the autoscaler will keep retrying the same node type if spot capacity is unavailable, rather than automatically falling back to on-demand. This limitation is discussed in community threads and GitHub issues, and is a key differentiator between Anyscale and open source Ray in this area. Would you like more detail or example config snippets?

Sources:

Hint: Mention @RunLLM in the post for followups.

raphael · October 8, 2025, 2:15pm

@RunLLM If I use anyscale with my AWS cloud, how much will it cost ? I use the following instance types : g4dn.xlarge and m5.large

RunLLM · October 8, 2025, 2:16pm

It could be possible that your total cost when using Anyscale with AWS depends on several factors: Anyscale platform fees, the AWS on-demand or spot pricing for your chosen instance types (g4dn.xlarge and m5.large), and any additional AWS resources (e.g., storage, load balancers, Redis/Memorystore for head node fault tolerance). Anyscale charges based on the type and duration of compute resources used, and AWS charges separately for the underlying infrastructure. You can view detailed usage and cost breakdowns in the Anyscale usage dashboard, but exact pricing is not specified in the documentation and may require checking both the Anyscale pricing page and the AWS pricing calculator.

Though I’m not entirely certain, it might be that you will also incur costs for required AWS services like MemoryDB or load balancers if you enable certain Anyscale features. For precise estimates, you should consult both Anyscale and AWS pricing resources, and use the Anyscale usage dashboard for real-time tracking. Would you like more detail on how to estimate or monitor these costs?

Sources:

https://docs.anyscale.com/administration/resource-management/head-node-fault-tolerance
https://docs.anyscale.com/services#permissions
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Cluster crashes when using spot instances Ray Clusters	0	476	May 31, 2021
Moving on to next available node type when AWS spot capacity unavailable? Ray Clusters	1	541	December 15, 2021
Ray cluster is stuck in creating worker nodes Ray Clusters	0	411	August 27, 2021
On-premise cluster: different worker node types Ray Clusters	5	920	June 16, 2023
How to resolve No available node types can fulfill resource Ray Clusters	0	1519	February 17, 2022

Multiple available_node_types, some spot, some non-spot

Related topics