Availability zones in ray cluster configuration

arunppsg · May 21, 2024, 5:17am

Two related questions:

What happens when we don’t specify availability zone in the cluster yaml configuration? Do nodes get launched across multiple availability zones in the region or is there any default AZ in such scenarios. In this regard, the docs here is not very useful.
When I specify availability zones in a config like this,

provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2c,us-east-2b

do nodes get launched across availability zones or only in a single availability zone with the order of preference being us-east-2a > us-east-2c > us-east-2b?

Sam_Chan · May 22, 2024, 10:28pm

It’ll try us-west-2a and if the call fails for whatever reason it’ll try 2b etc etc. If it runs out the entire call would just fail and nodes will not be launched at all.

arunppsg · May 23, 2024, 3:48am

Let’s say there is a requirement of 10 nodes and in us-east-2a, 5 nodes can be satisfied. Now, if I am not wrong, it will try to launch the next 5 nodes in us-east-2b. Suppose 3 can be satisfied in 2b, then the next 2 will be launched in 2c, making the cluster span across multiple AZs right?

Bruce_Zhang · June 13, 2024, 10:14pm

It depends on whether you turn on Multi-zone compute configs user guide | Anyscale Docs feature.
If the feature is off, it will try to launch all instances in the same zone. If the first instance launched in us-east-2a, then all follow up instances are required to launch in us-east-2a.
If the feature is on, instances can launch in all the zones and across different zones.

arunppsg · June 14, 2024, 6:20am

Is there some way to turn on/off the feature in ray cluster config file? Ideally, I want to avoid nodes across the cluster as it incurs data transfer charges and my training jobs are of low prioirty.

Sam_Chan · June 14, 2024, 7:52pm

If you just configure it with a single explicit zone it will stick there; does that work for you?

arunppsg · June 17, 2024, 9:10am

That works, but lot of times, a single zone does not have required capacity. So, my use case is: 1. nodes must be launched into a single zone. 2. If a single zone does not have the required capacity, then all nodes in it must be killed and launched into the next zone.

Sam_Chan · June 17, 2024, 6:21pm

Ah I see - you want the Ray Scheduler to respect affinity/stickiness to one region/zone… Let me get back to you on this…

Sam_Chan · June 24, 2024, 10:50pm

Re-reading both the Ray and Anyscale documentation I actually think the experience you’re looking for is already baked into the default settings.

Here’s what will happen if you specify multiple zones - let’s say Zone A and Zone B in the Rya Cluster config.

The Ray Autoscaler will attempt to spin up let’s say 5 machines. If it can’t find all 5 it will give up and go to Zone B
At Zone B it will try again, all five, and if it unable it will fall out and return “Unschedulable”

TLDR; a spread across multiple Zones will not occur because there’s no code path for such capability in OSS Ray.

arunppsg · June 25, 2024, 5:15pm

Thanks, that helpful Sam!

Topic		Replies	Views
Multiple availability zones for GCP Ray Clusters	6	727	July 31, 2021
Cluster launcher stuck on "Checking AWS environment settings"?	0	408	November 2, 2022
Moving on to next available node type when AWS spot capacity unavailable? Ray Clusters	1	538	December 15, 2021
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1198	April 2, 2024
Multiple available_node_types, some spot, some non-spot Ray Clusters	4	85	August 6, 2024

Availability zones in ray cluster configuration

Related topics