How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello!
On Azure, I have a resource group in which I want to be able to start multiple clusters with the ray up
command (all of the following uses ray 1.12.1).
These clusters are supposed to be independent from each other, in the sense that they do not share nodes.
The way I am distinguishing clusters is by name:
cluster_name: clustername
, i.e. I give each cluster a unique name.
However, occasionally (unfortunately I do not know the conditions to reproduce that, since it seems to be a 50/50 kind of thing), I have one cluster running with one name and when I want to start another cluster with a different name, instead of just starting that other cluster, the already running cluster is updated, i.e. I had “clusterA” running, did a ray up cluster_B.yaml
, where the name is clusterB
, but it just updates all the configs of clusterA
and completely ignores that the cluster I requested has a different name.
Here is an example stacktrace:
(ramblr_ray_job_submission)âžś cloud_training git:(main) âś— ray up ray_cluster_rayservetest.yaml --yes
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
**Cluster: rayservetest**
2022-07-04 15:05:28,506 INFO util.py:335 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y [automatic, due to --yes]
Currently running head node is out-of-date with cluster configuration
Current hash is e93ff2695b63b7e98669cd6a094cadd59c685c2b, expected a2f6048fae14d04119799c5a46ea656e6208d1f5
Acquiring an up-to-date head node
Relaunching the head node. Confirm [y/N]: y [automatic, due to --yes]
**Terminated head node ray-raytraintest-head-31e089440**
As you can see, I try to start the cluster rayservetest
, but it terminates raytraintest
first and then launches the other one.
Similar error I have encountered:
I ran ray down
on a cluster with a given name, but it did shut down another cluster.
So here are my questions:
- Is the name not supposed to make a cluster unique?
- If not, what makes a cluster unique?
- Should I be able to run different clusters in the same resource group?
- How does this error happen?
This seems related: [Bug] [Autoscaler] [Azure] Can't create multiple clusters on Azure · Issue #22996 · ray-project/ray · GitHub
[autoscaler] Enable creating multiple clusters in one resource group … by HongW2019 · Pull Request #22997 · ray-project/ray · GitHub
Thank you!