What makes a cluster "unique" in the resource group?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello!

On Azure, I have a resource group in which I want to be able to start multiple clusters with the ray up command (all of the following uses ray 1.12.1).
These clusters are supposed to be independent from each other, in the sense that they do not share nodes.
The way I am distinguishing clusters is by name:
cluster_name: clustername, i.e. I give each cluster a unique name.
However, occasionally (unfortunately I do not know the conditions to reproduce that, since it seems to be a 50/50 kind of thing), I have one cluster running with one name and when I want to start another cluster with a different name, instead of just starting that other cluster, the already running cluster is updated, i.e. I had “clusterA” running, did a ray up cluster_B.yaml, where the name is clusterB, but it just updates all the configs of clusterA and completely ignores that the cluster I requested has a different name.

Here is an example stacktrace:

(ramblr_ray_job_submission)➜  cloud_training git:(main) ✗ ray up ray_cluster_rayservetest.yaml --yes
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
**Cluster: rayservetest**

2022-07-04 15:05:28,506 INFO util.py:335 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y [automatic, due to --yes]

Currently running head node is out-of-date with cluster configuration
  Current hash is e93ff2695b63b7e98669cd6a094cadd59c685c2b, expected a2f6048fae14d04119799c5a46ea656e6208d1f5
Acquiring an up-to-date head node
  Relaunching the head node. Confirm [y/N]: y [automatic, due to --yes]
**Terminated head node ray-raytraintest-head-31e089440**

As you can see, I try to start the cluster rayservetest, but it terminates raytraintest first and then launches the other one.

Similar error I have encountered:
I ran ray down on a cluster with a given name, but it did shut down another cluster.

So here are my questions:

  • Is the name not supposed to make a cluster unique?
  • If not, what makes a cluster unique?
  • Should I be able to run different clusters in the same resource group?
  • How does this error happen?

This seems related: [Bug] [Autoscaler] [Azure] Can't create multiple clusters on Azure · Issue #22996 · ray-project/ray · GitHub
[autoscaler] Enable creating multiple clusters in one resource group … by HongW2019 · Pull Request #22997 · ray-project/ray · GitHub

Thank you!

This does sound like the bug and PR which you pointed to (and unfortunately slipped through the cracks).

What you’re doing should work (cluster name should be the only thing that has to be unique).

I’ll try to ping the maintainers of the azure node provider but as a mitigation do you think you could launch the clusters in different resource groups for now?

Hi Alex,

thanks for the quick reply!
I would appreciate if that would get fixed soon, but I can try to do different resource groups or look into Kubernetes instead.

Thank you!

1 Like

Let’s carry on the discussion on GitHub – it looks like we’re close to merging a solution there.
Will mark this topic as resolved since we’ve identified the underlying issue.