Azure access to launch ray auto scaling clusters

I am trying to launch an auto scaling cluster in azure using the yaml file provided in the ray documentation… running into multiple scope issues etc… would like to understand what are the accesses/permissions I need both at the resource group and subscription level to be able to run the cluster deployment successfully ? @bill-anyscale

Hey Prasad,

thanks for reaching out. What’s the error that you are getting? Here is the code that seems to be grabbing the information for what to launch:

from msrestazure.azure_active_directory import MSIAuthentication
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.resource.resources.models import DeploymentMode

I assume that Ray needs permissions along those dimensions. If you can provide the exception, it might be easier to see which one you are missing.

@bill-anyscale - I get the following permission errors… Microsoft.authorization/roleAssigments/write…
this was the first error I got… then when i upgraded my permission to the owner of the resource group from the contributor - its says unable to update tenant ID, subscription id, scope, principal id… etc…

why is it trying to change/update those things… the only way i could run the script was if i had been given owner permission to both subscription and resource group… which would not be possible as you understand in a corporate set up… can you please advise on on specific read/write access i would need to run the deployment script?

@gramhagen would you be able to help out here?

In order to use the azure autoscaler with ray the head node needs to be able to create / destroy vms. to support that we create a service principal with contributor access to the resource group. this means you will need authorization to create user assigned identities and provision role assignments for that identity. You should not need to be owner of both the subscription and resource group, typically contributor is enough.

There are additional resources created such as a vnet, network security group, and the resource group itself which will likely require contributor access for the subscription.

We currently have not implemented a way to create those resources up front and pass information (like a azure resource id) into the config which would skip creation. This would let you restrict access for individuals using ray to only contributor access for a specific RG. It may be worth adding a feature request in the github repo if you think that is valuable.

You could also look into using the Deploy to Azure method described here Launching Cloud Clusters — Ray v1.6.0
You still need contributor access to deploy it, but once set up you should be able to restrict access to the resource group. Note this method uses Virtual Machine Scale Sets to manage scaling the cluster up and down.

1 Like

@gramhagen @rliaw - I actually have owner access to the resource group but still i get this error.


why is the program trying to update tenant iD, application id etc… ? … Please advise what am i missing from an access perspective…

I am using the YAML deployment script - ray/example-full.yaml at master · ray-project/ray · GitHub

to create vms on the user’s behalf we use a user-assigned managed identity, which must have contributor role assignment for the resource group where the ray cluster is deployed. the error message you see is indicating you do not have authorization to assign a role to that identity.

Thanks for the reply… one last question… @gramhagen

to this point - “the error message you see is indicating you do not have authorization to assign a role to that identity”

Can you help me with the right role assignment access which would be needed in MSFT lingo ?

If you have owner role you will most likely have write permission for role assignment creation. Another problem that can happen is that if identity is deleted but the role assignment is not when you go to deploy again to the same resource group you will see a failure trying to create an identical role assignment again. you might want to try using a fresh resource group.

@gramhagen - thanks a lot !! creating a fresh resource group along with the role assignment access worked based on your last suggestion.
the script works… but i get this error…
image

any thoughts ?

hmm, not sure about that one, if it got that far the head node should exist, you can try duplicating the ssh command to ensure you are able to connect to the head node. sometimes there are network rules in place limiting ssh access to resources, depends on how your subscription is set up. you should be able to try ray up again without it trying to re-create resources, it can take a while to connect.

@gramhagen - I am able to see that the head node is set up… and i tried to ssh into that head node separately on that window… it worked… not sure why its not able to ssh in the ray up command