Multiple available_node_types, some spot, some non-spot

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I don’t have a good understanding of how available_node_types is supposed to work, when you have more than one.

I want to use spot instances as much as possible, but for ray cluster to use up to a certain number of non-spot instances as a fallback, in case the desired # of spot instances is not available. So, I created 2 node types in the available_node_types list.

The problem is that the cluster manager never tries to launch any non-spot instances. Instead it retries launching spot instances, which typically does not succeed.

Here is a a stripped down snippet from my yaml file

available_node_types:
    ray.head.default:
        # omitting this because it works fine
    ray.worker.nonspot_256:
        # big non-spot instances, in case we can't get enough spot instances
        min_workers: 0
        max_workers: 8  # limiting these because they're expensive
        resources: {"object_store_memory": 100000000}
        node_config:
            InstanceType: x2iedn.2xlarge  # 256 GB ram
            # omitting aws details which should not be relevant
    ray.worker.spot_256:
        min_workers: 0
        max_workers: 14  # usually can't get more than this many x2iedn.2xlarge instances
        resources: {"object_store_memory": 100000000}

        node_config:
            InstanceType: x2iedn.2xlarge  # 256 GB ram
            InstanceMarketOptions:
                MarketType: spot

we actually have this as part of the hosted ray solution on anyscale (see here for docs on how that works)

does Anyscale cost money? I was looking for a way to do it with Ray Cluster.

There isn’t currently a way to do this on Ray Clusters explicitly; can you please create a feature request on Github?

You can do something like that with KubeRay if your Kubernetes cluster is appropriately configured. For that to work you need to 1) configure different worker groups in RayCluster resource using different node selectors for pods, and 2) have your Kubernetes cluster able to spawn different types of nodes based on pods’ node selectors. For example, if you use AWS EKS with Karpenter, you can use karpenter.sh/capacity-type: spot node selector on the worker pods to get spot instances added to your cluster (see this blog for details). Similar approaches are available on other cloud providers.