How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Background
I am investigating the use of Ray for use in both a production and a separate R&D parallel compute resource solutions, utilizing the Azure cloud.
There are a number of issues that have been challenging me, with a particular limitation at the moment on Azure, that only a single cluster can be stood in a subscription. I believe there is potentially a fix on the way to be able to run multiple clusters in a resource group (see [autoscaler] Enable creating multiple clusters in one resource group … by HongW2019 · Pull Request #22997 · ray-project/ray · GitHub).
To investigate the flexibility of the Ray system, I have been looking into how dependencies are handled for executing remote tasks, and decided that the Docker container is the most controllable and desirable approach, especially as we have private package dependencies. I have been building docker containers for the execution of code on the remote Azure hosted cluster using the ray API
and the configuration YAML files.
Desired outcome
To be able to execute different problems, with different dependencies (accounted for with different docker containers), on a single cluster, such that at runtime, the task dependencies are given with the task execution.
This would be through either the provision of specific node_types associated with particular docker containers, or specifying the requirements through the runtime_env argument of the ray.init()
method.
Attempted solutions so far
To test these scenarios as potential solutions to the desired outcome, the cluster was setup on Azure by using the below shown configuration YAML (see Ray cluster YAML config for Azure, provided in full for reference) and the ray up
cli command.
Looking at the YAML config, some key features of the deployment are as follows:
-
Docker container deployment - The base deployment of the head node and worker nodes are docker containers which have a basic install of the Ray runtime, as well as some dependencies that were required for working with the runtime_env docker deployment (e.g. podman dependency and Azure CLI dependency for authenticating to a private repository).
-
Custom node types - some defined with different instance sizes, and some with specific docker specifications to supersede the default.
Currently there are two ways that I have tried to achieve this, and both and not working:
Approach 1 - Nodetype specification:
By specifying a node_type
configuration and used the node_docker
description to associate a particular docker container with node type through the use of resource keys, this should run a specific task on a docker container that has the dependencies installed already.
An example of such a definition is, where the node type ray.worker.default_bayes
has a custom resource key NODETYPE_DEFAULT_BAYES
and in the node definition, the docker fields have been provided to use a different worker image than the cluster base:
ray.worker.default_bayes:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 4
# The resources provided by this node type.
resources: {"CPU": 16, "NODETYPE_DEFAULT_BAYES": 16}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_D16s_v5
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
# optionally set priority to use Spot instances
priority: Spot
# set a maximum price for spot instances if desired
# billingProfile:
# maxPrice: -1
docker:
worker_image: "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.6"
pull_before_run: False
worker_run_options:
- --ulimit nofile=65536:65536
- --privileged # This is for enabling docker inside a docker execution [i.e. running a docker container from runtime_env].
Approach 2 - runtime_env docker container:
By supplying the container image at execution time using the runtime_env
argument of the ray.init()
method during ray client
connection to the cluster, this provides the correct execution environment with required preinstalled dependencies.
This was achieved by suppling the runtime_env
argument with a dictionary with the following definitions:
runtime_env = {
"eager_install": True,
"container": {
"image": "our_private_ACR.azurecr.io/smartreturntoolsbayes-ray-2.0.0dev-cpu:0.1.4",
#"run_options": ["--cap-drop SYS_ADMIN","--log-level=debug", "--privileged"]
}
}
Both of these approaches failed for different reasons
Help or pointers
I would be very glad if people could review this and tell me:
-
Am I setting up the cluster correctly in the first place to achieve the desired outcome?
-
For example, should the cluster be run in docker containers if containers are to be supplied as a runtime_env argument?
Are there additional run arguments required for the runtime_env docker container that needed to be added to get it run correctly? -
What is the way around the apparent head node dependency issue of deserializing input objects for tasks?
Is this purely a outcome because I am using theray client
approach instead of submitting the entire job to the server usingray submit
?