Availability zones in ray cluster configuration

Two related questions:

  • What happens when we don’t specify availability zone in the cluster yaml configuration? Do nodes get launched across multiple availability zones in the region or is there any default AZ in such scenarios. In this regard, the docs here is not very useful.

  • When I specify availability zones in a config like this,

provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2c,us-east-2b

do nodes get launched across availability zones or only in a single availability zone with the order of preference being us-east-2a > us-east-2c > us-east-2b?

It’ll try us-west-2a and if the call fails for whatever reason it’ll try 2b etc etc. If it runs out the entire call would just fail and nodes will not be launched at all.

Let’s say there is a requirement of 10 nodes and in us-east-2a, 5 nodes can be satisfied. Now, if I am not wrong, it will try to launch the next 5 nodes in us-east-2b. Suppose 3 can be satisfied in 2b, then the next 2 will be launched in 2c, making the cluster span across multiple AZs right?

It depends on whether you turn on Multi-zone compute configs user guide | Anyscale Docs feature.
If the feature is off, it will try to launch all instances in the same zone. If the first instance launched in us-east-2a, then all follow up instances are required to launch in us-east-2a.
If the feature is on, instances can launch in all the zones and across different zones.

Is there some way to turn on/off the feature in ray cluster config file? Ideally, I want to avoid nodes across the cluster as it incurs data transfer charges and my training jobs are of low prioirty.

If you just configure it with a single explicit zone it will stick there; does that work for you?

That works, but lot of times, a single zone does not have required capacity. So, my use case is: 1. nodes must be launched into a single zone. 2. If a single zone does not have the required capacity, then all nodes in it must be killed and launched into the next zone.

Ah I see - you want the Ray Scheduler to respect affinity/stickiness to one region/zone… Let me get back to you on this…

Re-reading both the Ray and Anyscale documentation I actually think the experience you’re looking for is already baked into the default settings.

Here’s what will happen if you specify multiple zones - let’s say Zone A and Zone B in the Rya Cluster config.

  • The Ray Autoscaler will attempt to spin up let’s say 5 machines. If it can’t find all 5 it will give up and go to Zone B
  • At Zone B it will try again, all five, and if it unable it will fall out and return “Unschedulable”

TLDR; a spread across multiple Zones will not occur because there’s no code path for such capability in OSS Ray.

Thanks, that helpful Sam!

1 Like