Ray Cluster on Azure with runtime_env=docker help

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Background

I am investigating the use of Ray for use in both a production and a separate R&D parallel compute resource solutions, utilizing the Azure cloud.

There are a number of issues that have been challenging me, with a particular limitation at the moment on Azure, that only a single cluster can be stood in a subscription. I believe there is potentially a fix on the way to be able to run multiple clusters in a resource group (see [autoscaler] Enable creating multiple clusters in one resource group … by HongW2019 · Pull Request #22997 · ray-project/ray · GitHub).

To investigate the flexibility of the Ray system, I have been looking into how dependencies are handled for executing remote tasks, and decided that the Docker container is the most controllable and desirable approach, especially as we have private package dependencies. I have been building docker containers for the execution of code on the remote Azure hosted cluster using the ray API and the configuration YAML files.

Desired outcome

To be able to execute different problems, with different dependencies (accounted for with different docker containers), on a single cluster, such that at runtime, the task dependencies are given with the task execution.

This would be through either the provision of specific node_types associated with particular docker containers, or specifying the requirements through the runtime_env argument of the ray.init() method.

Attempted solutions so far

To test these scenarios as potential solutions to the desired outcome, the cluster was setup on Azure by using the below shown configuration YAML (see Ray cluster YAML config for Azure, provided in full for reference) and the ray up cli command.

Looking at the YAML config, some key features of the deployment are as follows:

  • Docker container deployment - The base deployment of the head node and worker nodes are docker containers which have a basic install of the Ray runtime, as well as some dependencies that were required for working with the runtime_env docker deployment (e.g. podman dependency and Azure CLI dependency for authenticating to a private repository).

  • Custom node types - some defined with different instance sizes, and some with specific docker specifications to supersede the default.

Currently there are two ways that I have tried to achieve this, and both and not working:

Approach 1 - Nodetype specification:

By specifying a node_type configuration and used the node_docker description to associate a particular docker container with node type through the use of resource keys, this should run a specific task on a docker container that has the dependencies installed already.

An example of such a definition is, where the node type ray.worker.default_bayes has a custom resource key NODETYPE_DEFAULT_BAYES and in the node definition, the docker fields have been provided to use a different worker image than the cluster base:

ray.worker.default_bayes:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_BAYES": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.6"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

Approach 2 - runtime_env docker container:

By supplying the container image at execution time using the runtime_env argument of the ray.init() method during ray client connection to the cluster, this provides the correct execution environment with required preinstalled dependencies.

This was achieved by suppling the runtime_env argument with a dictionary with the following definitions:

runtime_env = {
     "eager_install": True,
     "container":  {
         "image": "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4",
         #"run_options": ["--cap-drop SYS_ADMIN","--log-level=debug", "--privileged"]
    }
}

Both of these approaches failed for different reasons

Help or pointers

I would be very glad if people could review this and tell me:

  1. Am I setting up the cluster correctly in the first place to achieve the desired outcome?

  2. For example, should the cluster be run in docker containers if containers are to be supplied as a runtime_env argument?
    Are there additional run arguments required for the runtime_env docker container that needed to be added to get it run correctly?

  3. What is the way around the apparent head node dependency issue of deserializing input objects for tasks?
    Is this purely a outcome because I am using the ray client approach instead of submitting the entire job to the server using ray submit?

This is my full cluster configuration file for deployment on Azure.

# An unique identifier for the head node and workers of this cluster.
cluster_name: ecm-ray-v2-dev

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 4

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "our_private_ACR.azurecr.io/ray-ml-2.0.0dev-cpu:0.1.1" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: false
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-cpu"
    # Allow Ray to automatically detect GPUs

    worker_image: "our_private_ACR.azurecr.io/ray-ml-2.0.0dev-cpu:0.1.1"
    worker_run_options: # []
        - --ulimit nofile=65536:65536
        - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westeurope
    resource_group: data-science-ray-clusters-rg
    # set subscription id otherwise the default from az cli will be used
    subscription_id: 3da03240-09fb-4d1e-8631-c507fcd3dd9b
    tags: {'owner' : 'my_email'}
    # When stopping worker nodes, don't delete them but just deallocate them.
    cache_stopped_nodes: true

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: azureuser
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/ray_cluster_bayesian_id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/ray_cluster_bayesian_id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 0}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

    ray.worker.default_bayes:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_BAYES": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.6"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    ray.worker.default_itc:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 16, "NODETYPE_DEFAULT_ITC": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D16s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
        docker:
            worker_image: "our_private_ACR.azurecr.io/itctools-ray-2.0.0dev:0.1.0"
            pull_before_run: False
            worker_run_options:
                - --ulimit nofile=65536:65536
                - --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].

    ray.worker.small_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 8, "small_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D8s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
    
    ray.worker.large_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 32, "large_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D32s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

    ray.worker.vlarge_cpu:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 64, "vlarge_cpu": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D64s_v5
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/ray_cluster_bayesian_id_rsa.pub" : "~/.ssh/ray_cluster_bayesian_id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful
    # Bespke commands for setting up access to the Azure Container Registry.
    # The node already has the managed identity associated with it, which has access priveleges to the ACR, so login is simple
    - |
        az login --identity
        az acr login --name our_private_ACR
    
   
# List of shell commands to run to set up nodes.
# Custom commands that will be run on the all nodes after common setup [this is inside the docker container if using docker images].
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"


# Custom commands that will be run on the head node after common setup [this is inside the docker container if using docker images].
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: # []
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
    # Get ACR access inside the docker container for PODMAN access.
    # First need to start docker service in the head node docker container.
    - sudo service docker start
    - sleep 20
    # Second then get credentials using the Managed Identity attached the headnode.
    - |
        az login --identity
        az acr login --name our_private_ACR

# Custom commands that will be run on worker nodes after common setup [this is inside the docker container if using docker images].
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
    

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}