Cluster configuration on Azure running docker containers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Background

I am investigating the use of Ray for use in both a production and a separate R&D parallel compute resource solutions, utilizing the Azure cloud.

There are a number of issues that have been challenging me, with a particular limitation at the moment on Azure, that only a single cluster can be stood in a subscription. I believe there is potentially a fix on the way to be able to run multiple clusters in a resource group (see [autoscaler] Enable creating multiple clusters in one resource group … by HongW2019 · Pull Request #22997 · ray-project/ray · GitHub).

To investigate the flexibility of the Ray system, I have been looking into how dependencies are handled for executing remote tasks, and decided that the Docker container is the most controllable and desirable approach, especially as we have private package dependencies. I have been building docker containers for the execution of code on the remote Azure hosted cluster using the ray API and the configuration YAML files.

Desired outcome

To be able to execute different problems, with different dependencies (accounted for with different docker containers), on a single cluster, such that at runtime, the task dependencies are given with the task execution.

This would be through either the provision of specific node_types associated with particular docker containers, or specifying the requirements through the runtime_env argument of the ray.init() method.

Attempted solutions so far

To test these scenarios as potential solutions to the desired outcome, the cluster was setup on Azure by using the below shown configuration YAML (see Ray cluster YAML config for Azure, provided in full for reference) and the ray up cli command.

Looking at the YAML config, some key features of the deployment are as follows:

  • Docker container deployment - The base deployment of the head node and worker nodes are docker containers which have a basic install of the Ray runtime, as well as some dependencies that were required for working with the runtime_env docker deployment (e.g. podman dependency and Azure CLI dependency for authenticating to a private repository).

  • Custom node types - some defined with different instance sizes, and some with specific docker specifications to supersede the default.

Currently there are two ways that I have tried to achieve this, and both and not working:

Approach 1 - Nodetype specification:

By specifying a node_type configuration and used the node_docker description to associate a particular docker container with node type through the use of resource keys, this should run a specific task on a docker container that has the dependencies installed already.

An example of such a definition is, where the node type ray.worker.default_bayes has a custom resource key NODETYPE_DEFAULT_BAYES and in the node definition, the docker fields have been provided to use a different worker image than the cluster base:

ray.worker.default_bayes:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_BAYES": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.6"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

Approach 2 - runtime_env docker container:

By supplying the container image at execution time using the runtime_env argument of the ray.init() method during ray client connection to the cluster, this provides the correct execution environment with required preinstalled dependencies.

This was achieved by suppling the runtime_env argument with a dictionary with the following definitions:

runtime_env = {
     "eager_install": True,
     "container":  {
         "image": "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4",
         #"run_options": ["--cap-drop SYS_ADMIN","--log-level=debug", "--privileged"]
    }
}

Both of these approaches failed for different reasons

Questions

  1. Does this setup sound reasonable or have I misunderstood how to setup Ray for my desired outcome?

  2. Is it right that the runtime_env use of docker containers is compatible with a Ray cluster running already on docker containers?

  3. What is the reason for the requirement for runtime dependencies on the head node when trying node_type approach?
    Is this a fundamental limitation of using the ray client API approach, instead of using ray submit and an entry script?

Approach failure details

Approach 1 - Nodetype specification

The nodetype approach failed due to a unpickle error at execution time. This failed on a dependency issue relating to a local private package. It has been installed into the bespoke docker container that the tasks are supposed to be executed in, but it would appear that it fails on the head node due to this requirement.

The traceback from the python driver script was as follows:

Put failed:
Traceback (most recent call last):
  File "pkg_test_corbana_bayesian_multiple_regression__ray_cluster_v2dev_001.py", line 156, in <module>
    bayes_mlr.fit()
  File "/datadrive/drive0/projects/argrilytics_4100_bayesian_multiple_regression/SmartReturnTools_bayesian/exploration/bayesian_regression/pkg/SmartReturnToolsBayes/model.py", line 254, in fit
    model_id = ray_model_sampling.remote(**model_config_dict)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/remote_function.py", line 227, in remote
    return func_cls._remote(args=args, kwargs=kwargs, **options)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 303, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/remote_function.py", line 270, in _remote
    return client_mode_convert_function(
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 177, in client_mode_convert_function
    return client_func._remote(in_args, in_kwargs, **kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/common.py", line 303, in _remote
    return self.options(**option_args).remote(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/common.py", line 577, in remote
    return return_refs(ray.call_remote(self, *args, **kwargs))
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/api.py", line 109, in call_remote
    return self.worker.call_remote(instance, *args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/worker.py", line 544, in call_remote
    task = instance._prepare_client_task()
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/common.py", line 583, in _prepare_client_task
    task = self._remote_stub._prepare_client_task()
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/common.py", line 329, in _prepare_client_task
    self._ensure_ref()
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/common.py", line 324, in _ensure_ref
    self._ref = ray.worker._put_pickled(
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/worker.py", line 498, in _put_pickled
    raise cloudpickle.loads(resp.error)
ModuleNotFoundError: No module named 'SmartReturnToolsBayes'

The error logs for the job on the server in ray_client_server_23002.err were:

2022-03-31 06:29:32,560 INFO server.py:843 -- Starting Ray Client server on 0.0.0.0:23002
2022-03-31 06:29:32,952 INFO logservicer.py:102 -- New logs connection established. Total clients: 1
2022-03-31 06:29:32,955 INFO worker.py:946 -- Connecting to existing Ray cluster at address: 10.41.0.4:6379
2022-03-31 06:29:40,223 ERROR server.py:503 -- Put failed:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/server.py", line 499, in _put_object
    obj = loads_from_client(data, self)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/server_pickler.py", line 130, in loads_from_client
    return ClientUnpickler(
ModuleNotFoundError: No module named 'SmartReturnToolsBayes'
2022-03-31 06:29:44,593 INFO server.py:890 -- 25 idle checks before shutdown.
2022-03-31 06:29:49,603 INFO server.py:890 -- 20 idle checks before shutdown.
2022-03-31 06:29:54,613 INFO server.py:890 -- 15 idle checks before shutdown.
2022-03-31 06:29:59,623 INFO server.py:890 -- 10 idle checks before shutdown.
2022-03-31 06:30:04,632 INFO server.py:890 -- 5 idle checks before shutdown.

Approach 2 - runtime_env docker container specification:

In this approach, podman successfully pulls the container on the head node, and starts it running, but the client server times out and shutdowns without any scheduling or execution of tasks.

The traceback returned by the driver script is as follows:

(py38_pymc3_ray_v2.0.0dev_bsrt) azureuser@ecm-smart-return-dev-001-vm:~/bayesian_regression/notebooks/package_based$ python pkg_test_corbana_bayesian_multiple_regression__ray_cluster_v2dev_runtime_docker_001.py
[INFO] Job is executing using a Ray Compute cluster.
[INFO] Models to be run on ray cluster at address:: ray://localhost:10001
[INFO] - ray package has been found and beginning cluster connection process.......there is a problem.
[ERROR] - Failed to connect to the Ray Compute Cluster.
Traceback (most recent call last):
  File "pkg_test_corbana_bayesian_multiple_regression__ray_cluster_v2dev_runtime_docker_001.py", line 129, in <module>
    bayes_mlr = BayesianMultipleRegression(
  File "/datadrive/drive0/projects/argrilytics_4100_bayesian_multiple_regression/SmartReturnTools_bayesian/exploration/bayesian_regression/pkg/SmartReturnToolsBayes/model.py", line 40, in __init__
    self._setup_ray_job(ray_config=ray_config)
  File "/datadrive/drive0/projects/argrilytics_4100_bayesian_multiple_regression/SmartReturnTools_bayesian/exploration/bayesian_regression/pkg/SmartReturnToolsBayes/model.py", line 347, in _setup_ray_job
    self._ray_context = ray.init(address=self.ray_cluster_url_str, runtime_env=self.ray_runtime_env)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/worker.py", line 882, in init
    return builder.connect()
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/client_builder.py", line 160, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client_connect.py", line 36, in connect
    conn = ray.connect(
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/__init__.py", line 243, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/home/azureuser/.conda/envs/py38_pymc3_ray_v2.0.0dev_bsrt/lib/python3.8/site-packages/ray/util/client/worker.py", line 803, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 664, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23001.err for detailed logs.

The ray_client_server_23000.err file reports the following:

time="2022-03-31T06:21:03-07:00" level=info msg="podman filtering at log level debug"
time="2022-03-31T06:21:03-07:00" level=debug msg="Called run.PersistentPreRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=546 --cap-drop SYS_ADMIN --log-level=debug --privileged --entrypoint python our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4 -m ray.util.client.server --address=10.41.0.4:6379 --host=0.0.0.0 --port=23001 --mode=specific-server --redis-password=5241590000000000)"
time="2022-03-31T06:21:03-07:00" level=debug msg="Merged system config \"/usr/share/containers/containers.conf\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Using conmon: \"/usr/libexec/podman/conmon\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Initializing boltdb state at /home/ray/.local/share/containers/storage/libpod/bolt_state.db"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using graph driver overlay"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using graph root /home/ray/.local/share/containers/storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using run root /tmp/podman-run-1000/containers"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using static dir /home/ray/.local/share/containers/storage/libpod"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using tmp dir /tmp/podman-run-1000/libpod/tmp"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using volume path /home/ray/.local/share/containers/storage/volumes"
time="2022-03-31T06:21:03-07:00" level=debug msg="Set libpod namespace to \"\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Not configuring container store"
time="2022-03-31T06:21:03-07:00" level=debug msg="Initializing event backend file"
time="2022-03-31T06:21:03-07:00" level=debug msg="configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument"
time="2022-03-31T06:21:03-07:00" level=debug msg="configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using OCI runtime \"/usr/bin/crun\""
time="2022-03-31T06:21:03-07:00" level=info msg="Found CNI network podman (type=bridge) at /home/ray/.config/cni/net.d/87-podman.conflist"
time="2022-03-31T06:21:03-07:00" level=debug msg="Default CNI network name podman is unchangeable"
time="2022-03-31T06:21:03-07:00" level=info msg="Setting parallel job count to 7"
time="2022-03-31T06:21:03-07:00" level=info msg="podman filtering at log level debug"
time="2022-03-31T06:21:03-07:00" level=debug msg="Called run.PersistentPreRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=546 --cap-drop SYS_ADMIN --log-level=debug --privileged --entrypoint python our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4 -m ray.util.client.server --address=10.41.0.4:6379 --host=0.0.0.0 --port=23001 --mode=specific-server --redis-password=5241590000000000)"
time="2022-03-31T06:21:03-07:00" level=debug msg="overlay storage already configured with a mount-program"
time="2022-03-31T06:21:03-07:00" level=debug msg="Merged system config \"/usr/share/containers/containers.conf\""
time="2022-03-31T06:21:03-07:00" level=debug msg="overlay storage already configured with a mount-program"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using conmon: \"/usr/libexec/podman/conmon\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Initializing boltdb state at /home/ray/.local/share/containers/storage/libpod/bolt_state.db"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using graph driver overlay"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using graph root /home/ray/.local/share/containers/storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using run root /tmp/podman-run-1000/containers"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using static dir /home/ray/.local/share/containers/storage/libpod"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using tmp dir /tmp/podman-run-1000/libpod/tmp"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using volume path /home/ray/.local/share/containers/storage/volumes"
time="2022-03-31T06:21:03-07:00" level=debug msg="overlay storage already configured with a mount-program"
time="2022-03-31T06:21:03-07:00" level=debug msg="Set libpod namespace to \"\""
time="2022-03-31T06:21:03-07:00" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2022-03-31T06:21:03-07:00" level=debug msg="overlay: mount_program=/usr/bin/fuse-overlayfs"
time="2022-03-31T06:21:03-07:00" level=debug msg="backingFs=overlayfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false"
time="2022-03-31T06:21:03-07:00" level=debug msg="Initializing event backend file"
time="2022-03-31T06:21:03-07:00" level=debug msg="configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument"
time="2022-03-31T06:21:03-07:00" level=debug msg="configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument"
time="2022-03-31T06:21:03-07:00" level=debug msg="Using OCI runtime \"/usr/bin/crun\""
time="2022-03-31T06:21:03-07:00" level=info msg="Found CNI network podman (type=bridge) at /home/ray/.config/cni/net.d/87-podman.conflist"
time="2022-03-31T06:21:03-07:00" level=debug msg="Default CNI network name podman is unchangeable"
time="2022-03-31T06:21:03-07:00" level=info msg="Setting parallel job count to 7"
time="2022-03-31T06:21:03-07:00" level=info msg="Failed to detect the owner for the current cgroup: stat /sys/fs/cgroup/systemd/docker/2d529a871a976441ea712450e72fa874955dab7c517f366fa7bbdf3574ad5f91: no such file or directory"
time="2022-03-31T06:21:03-07:00" level=debug msg="Pulling image our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4 (policy: missing)"
time="2022-03-31T06:21:03-07:00" level=debug msg="Looking up image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Trying \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" ..."
time="2022-03-31T06:21:03-07:00" level=debug msg="parsed reference into \"[overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage ([overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee)"
time="2022-03-31T06:21:03-07:00" level=debug msg="Looking up image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Trying \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" ..."
time="2022-03-31T06:21:03-07:00" level=debug msg="parsed reference into \"[overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage ([overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee)"
time="2022-03-31T06:21:03-07:00" level=debug msg="User mount /tmp/ray:/tmp/ray options []"
time="2022-03-31T06:21:03-07:00" level=debug msg="Looking up image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Trying \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" ..."
time="2022-03-31T06:21:03-07:00" level=debug msg="parsed reference into \"[overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage ([overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee)"
time="2022-03-31T06:21:03-07:00" level=debug msg="Inspecting image 3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee"
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Looking up image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Trying \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" ..."
time="2022-03-31T06:21:03-07:00" level=debug msg="parsed reference into \"[overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage"
time="2022-03-31T06:21:03-07:00" level=debug msg="Found image \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" as \"our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4\" in local containers storage ([overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee)"
time="2022-03-31T06:21:03-07:00" level=debug msg="Inspecting image 3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee"
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="Inspecting image 3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee"
time="2022-03-31T06:21:03-07:00" level=debug msg="using systemd mode: false"
time="2022-03-31T06:21:03-07:00" level=debug msg="Loading seccomp profile from \"/usr/share/containers/seccomp.json\""
time="2022-03-31T06:21:03-07:00" level=info msg="Sysctl net.ipv4.ping_group_range=0 0 ignored in containers.conf, since Network Namespace set to host"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/autofs"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/full"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/fuse"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/input/js0"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/kmsg"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/net/tun"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/null"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/random"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/rfkill"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/urandom"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/vfio/vfio"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/zero"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/pts"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /sys"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /dev/mqueue"
time="2022-03-31T06:21:03-07:00" level=debug msg="Adding mount /proc"
time="2022-03-31T06:21:03-07:00" level=debug msg="Allocated lock 1 for container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d"
time="2022-03-31T06:21:03-07:00" level=debug msg="parsed reference into \"[overlay@/home/ray/.local/share/containers/storage+/tmp/podman-run-1000/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:03-07:00" level=debug msg="exporting opaque data as blob \"sha256:3d240be04bbb9e2b8dbe671a7d1ee8a62b2bde0e75235eb1e2766ee9d431cfee\""
time="2022-03-31T06:21:04-07:00" level=debug msg="created container \"17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d\""
time="2022-03-31T06:21:04-07:00" level=debug msg="container \"17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d\" has work directory \"/home/ray/.local/share/containers/storage/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata\""
time="2022-03-31T06:21:04-07:00" level=debug msg="container \"17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d\" has run directory \"/tmp/podman-run-1000/containers/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata\""
time="2022-03-31T06:21:04-07:00" level=debug msg="Not attaching to stdin"
time="2022-03-31T06:21:04-07:00" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2022-03-31T06:21:04-07:00" level=debug msg="overlay: mount_program=/usr/bin/fuse-overlayfs"
time="2022-03-31T06:21:04-07:00" level=debug msg="backingFs=overlayfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false"
time="2022-03-31T06:21:04-07:00" level=debug msg="overlay: mount_data=,lowerdir=/home/ray/.local/share/containers/storage/overlay/l/WKM4MASOE35KHXBN7OTNV45HXN:/home/ray/.local/share/containers/storage/overlay/l/ENAET6Y7U7ARXJ7YCLT4QSFLZ5:/home/ray/.local/share/containers/storage/overlay/l/URDPUX2GC77GQWIZ5HUWXSVG5Y:/home/ray/.local/share/containers/storage/overlay/l/SC5VQVBHNHUBIZAAR4SVJZVYCX:/home/ray/.local/share/containers/storage/overlay/l/RPAB7CWG5BR6IU6N7LXNLBNI72:/home/ray/.local/share/containers/storage/overlay/l/4A72RPJR5L7DCHFJMW6TTNFENU:/home/ray/.local/share/containers/storage/overlay/l/VLT4DLIMEOSHXVQIXERDBFXREL:/home/ray/.local/share/containers/storage/overlay/l/3IZUDADYOXFM5DSLT2AF4SR5CV:/home/ray/.local/share/containers/storage/overlay/l/WGETRGAG6MVSVUL6LAFMDSPCSA:/home/ray/.local/share/containers/storage/overlay/l/2FZNNXB7PPEEVYU36ZG4G6ACSF:/home/ray/.local/share/containers/storage/overlay/l/CSHLEY75FRLAP4D4VLMLYHPR47:/home/ray/.local/share/containers/storage/overlay/l/VQ6UJEULKIXDSQ5OKJF2RLKDQT,upperdir=/home/ray/.local/share/containers/storage/overlay/ea54d4a4c01378b6dfe9fe9171378da248a39b0fc1005a95aa07f94b9a028d1e/diff,workdir=/home/ray/.local/share/containers/storage/overlay/ea54d4a4c01378b6dfe9fe9171378da248a39b0fc1005a95aa07f94b9a028d1e/work"
time="2022-03-31T06:21:04-07:00" level=debug msg="mounted container \"17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d\" at \"/home/ray/.local/share/containers/storage/overlay/ea54d4a4c01378b6dfe9fe9171378da248a39b0fc1005a95aa07f94b9a028d1e/merged\""
time="2022-03-31T06:21:04-07:00" level=debug msg="Created root filesystem for container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d at /home/ray/.local/share/containers/storage/overlay/ea54d4a4c01378b6dfe9fe9171378da248a39b0fc1005a95aa07f94b9a028d1e/merged"
time="2022-03-31T06:21:04-07:00" level=debug msg="network configuration does not support host.containers.internal address"
time="2022-03-31T06:21:04-07:00" level=debug msg="Not modifying container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d /etc/passwd"
time="2022-03-31T06:21:04-07:00" level=debug msg="Modifying container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d /etc/group"
time="2022-03-31T06:21:04-07:00" level=debug msg="/etc/system-fips does not exist on host, not mounting FIPS mode subscription"
time="2022-03-31T06:21:04-07:00" level=debug msg="set root propagation to \"rslave\""
time="2022-03-31T06:21:04-07:00" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d"
time="2022-03-31T06:21:04-07:00" level=debug msg="Workdir \"/home/ray\" resolved to host path \"/home/ray/.local/share/containers/storage/overlay/ea54d4a4c01378b6dfe9fe9171378da248a39b0fc1005a95aa07f94b9a028d1e/merged/home/ray\""
time="2022-03-31T06:21:04-07:00" level=debug msg="Created OCI spec for container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d at /home/ray/.local/share/containers/storage/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata/config.json"
time="2022-03-31T06:21:04-07:00" level=debug msg="/usr/libexec/podman/conmon messages will be logged to syslog"
time="2022-03-31T06:21:04-07:00" level=debug msg="running conmon: /usr/libexec/podman/conmon" args="[--api-version 1 -c 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d -u 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d -r /usr/bin/crun -b /home/ray/.local/share/containers/storage/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata -p /tmp/podman-run-1000/containers/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata/pidfile -n brave_neumann --exit-dir /tmp/podman-run-1000/libpod/tmp/exits --full-attach -l k8s-file:/home/ray/.local/share/containers/storage/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /tmp/podman-run-1000/containers/overlay-containers/17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/ray/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /tmp/podman-run-1000/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /tmp/podman-run-1000/libpod/tmp --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mount_program=/usr/bin/fuse-overlayfs --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d]"
time="2022-03-31T06:21:04-07:00" level=info msg="Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/conmon: permission denied"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

time="2022-03-31T06:21:04-07:00" level=debug msg="Received: 1339"
time="2022-03-31T06:21:04-07:00" level=info msg="Got Conmon PID as 1336"
time="2022-03-31T06:21:04-07:00" level=debug msg="Created container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d in OCI runtime"
time="2022-03-31T06:21:04-07:00" level=debug msg="Attaching to container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d"
time="2022-03-31T06:21:04-07:00" level=debug msg="Starting container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d with command [python -m ray.util.client.server --address=10.41.0.4:6379 --host=0.0.0.0 --port=23001 --mode=specific-server --redis-password=5241590000000000]"
time="2022-03-31T06:21:04-07:00" level=debug msg="Started container 17b51a8292afb00f0ac03a17c949dfc8ee6f7ed0c38325240758a3c1ea44598d"
time="2022-03-31T06:21:04-07:00" level=debug msg="Enabling signal proxying"
2022-03-31 06:21:06,618 INFO server.py:843 -- Starting Ray Client server on 0.0.0.0:23001
2022-03-31 06:21:11,670 INFO server.py:890 -- 25 idle checks before shutdown.
2022-03-31 06:21:16,680 INFO server.py:890 -- 20 idle checks before shutdown.
2022-03-31 06:21:21,690 INFO server.py:890 -- 15 idle checks before shutdown.
2022-03-31 06:21:26,701 INFO server.py:890 -- 10 idle checks before shutdown.
2022-03-31 06:21:31,711 INFO server.py:890 -- 5 idle checks before shutdown.
time="2022-03-31T06:21:36-07:00" level=debug msg="Called run.PersistentPostRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=546 --cap-drop SYS_ADMIN --log-level=debug --privileged --entrypoint python our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4 -m ray.util.client.server --address=10.41.0.4:6379 --host=0.0.0.0 --port=23001 --mode=specific-server --redis-password=5241590000000000)"

Ray cluster YAML config for Azure

The current version of the YAML cluster configuration I am using is below:

# An unique identifier for the head node and workers of this cluster.
cluster_name: ecm-ray-v2-dev

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 4

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "our_private_ACR.azurecr.io/ray-ml-2.0.0dev-cpu:0.1.1" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: false
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-cpu"
    # Allow Ray to automatically detect GPUs

    worker_image: "our_private_ACR.azurecr.io/ray-ml-2.0.0dev-cpu:0.1.1"
    worker_run_options: # []
        - --ulimit nofile=65536:65536
        - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westeurope
    resource_group: data-science-ray-clusters-rg
    # set subscription id otherwise the default from az cli will be used
    subscription_id: 3da03240-09fb-4d1e-8631-c507fcd3dd9b
    tags: {'owner' : 'my_email'}
    # When stopping worker nodes, don't delete them but just deallocate them.
    cache_stopped_nodes: true

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: azureuser
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/ray_cluster_bayesian_id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/ray_cluster_bayesian_id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 0}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

    ray.worker.default_bayes:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_BAYES": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.6"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    ray.worker.default_itc:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_ITC": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/itctools-ray-2.0.0dev:0.1.0"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    ray.worker.small_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 8, "small_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D8s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
    
    ray.worker.large_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 32, "large_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D32s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

    ray.worker.vlarge_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 64, "vlarge_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D64s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/ray_cluster_bayesian_id_rsa.pub" : "~/.ssh/ray_cluster_bayesian_id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful
    # Bespke commands for setting up access to the Azure Container Registry.
    # The node already has the managed identity associated with it, which has access priveleges to the ACR, so login is simple
    - |
        az login --identity
        az acr login --name our_private_ACR
    
   
# List of shell commands to run to set up nodes.
# Custom commands that will be run on the all nodes after common setup [this is inside the docker container if using docker images].
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"


# Custom commands that will be run on the head node after common setup [this is inside the docker container if using docker images].
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: # []
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
    # Get ACR access inside the docker container for PODMAN access.
    # First need to start docker service in the head node docker container.
    - sudo service docker start
    - sleep 20
    # Second then get credentials using the Managed Identity attached the headnode.
    - |
        az login --identity
        az acr login --name our_private_ACR

# Custom commands that will be run on worker nodes after common setup [this is inside the docker container if using docker images].
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
    

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

@gramhagen , can you please help answer this question?

I don’t think this setup is specific to Azure, other than the limitation you mention about multiple clusters in the same resource group.

Based on your desired outcome I would think using separate clusters would be much more straightforward. I wasnt even aware you could switch environments in the same cluster.

I do know that because commands are sent from the head node it’s important to maintain the same environment (dependency versions) to ensure cloud pickling works as expected.

Beyond that I’m not sure how to debug your current issues and would probably try to look at splitting up the clusters and using custom docker images for dependencies.