Azure Get Cluster Status

I am trying to start a ray cluster under Azure via the Azure cloud shell as described in the example
https://techcommunity.microsoft.com/t5/ai-customer-engineering-team/deploying-ray-on-azure-for-scalable-distributed-reinforcement/ba-p/1329036
with the setup file example-full.yaml
However, I have commented out the conda setup commands as ray up always seemed to fail at those (could not find environment py37_tensorflow).
With this I can start a head node and connect to it via ray attach.
Then, I try to run ray status and get No Cluster Status with a ModuleNotFoundError:


I tried to install azure-common manually (via pip), but it is already installed.
I get the same error if I try to call ray.init(address=‘auto’) in a python script.
Furthermore I noticed that this setup only creates a head node but no worker nodes as described in the tutorial.
How can I get ray status and init to work? Is the problem with the worker node potentially liked to this issue?

Thank you

Hi, there have been a few updates since that post was made. The latest example_full.yaml actually deploys a docker image to the nodes and doesn’t use the base linux environment of the dsvm image (so those conda envs don’t exist). I’ve created a PR here which updates a few things in the yamls and the azure node provider to work with changes to azure sdk functions.

Please let me know if the example_full.yaml used there works for you.

As for the worker nodes, they are deployed from the head node, so if something was failing there that could explain why there are no worker nodes. Also, the default minimum number of workers is 0 so it will only deploy workers once there are processes requiring the autoscaler to scale the cluster up.

You can force a certain number of workers using the min_workers property in the yaml file:

max_workers: 2
min_workers: 1

Hi, thank you for your answer.

I have kept a close eye on the github issue you mentioned and now that it was merged, installed the latest (nightly) wheel in the cloud shell and also on the head node. However, my problem persists. Curiously, I still get this error, even though I can verify that the folder has been renamed from azure to _azure.

I have tried to uninstall and reinstall ray several times also in a separate conda environment and tried building it from the git repository, however the error I receive does not change.
I have also tried to force the creation of a worker with min_workers : 1 but no worker was created.
What else could I try?

Thank you

Update:
ray status is working now.
Using the newest version of the example yaml file, I changed the imageVersion to latest and the docker image to image: “rayproject/ray-ml:nightly-py37-gpu”
Thank you