Hi team,
I’m trying to use Ray serve multi-application configs to deploy an application to my kubernetes cluster with a presigned aws url for my working_dir. It seems to work locally, but when I try to deploy to kubernetes, I get ModuleNotFoundError: No module named 'deployments'
Here is my setup
Directory:
application_root_dir
- deployments.py (contains all the ray deployments)
- config.yaml (ray deployment config)
...
Config.yaml file:
# This file was generated using the `serve build` command on Ray v2.4.0.
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: 8000
applications:
- name: app1
route_prefix: /path
import_path: deployments:model_name
runtime_env: {}
deployments:
- name: model_name
autoscaling_config:
min_replicas: 1
initial_replicas: 2
max_replicas: 5
target_num_ongoing_requests_per_replica: 10.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
downscale_delay_s: 600.0
upscale_delay_s: 30.0
ray_actor_options:
runtime_env:
pip:
...
working_dir: {aws s3 presigned url}
num_cpus: 1.0
Kubernetes config:
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: raycluster-sample
spec:
rayVersion: '2.4.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
enableInTreeAutoscaling: true
autoscalerOptions:
# upscalingMode is "Default" or "Aggressive."
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
# Default: Upscaling is not rate-limited.
# Aggressive: An alias for Default; upscaling is not rate-limited.
upscalingMode: Default
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
idleTimeoutSeconds: 60
# image optionally overrides the autoscaler's container image.
# If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
# the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
## image: "my-repo/my-custom-autoscaler-image:tag"
# imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent).
imagePullPolicy: IfNotPresent
# Optionally specify the autoscaler container's securityContext.
securityContext: {}
env: []
envFrom: []
# resources specifies optional resource request and limit overrides for the autoscaler container.
# The default autoscaler resource limits and requests should be sufficient for production use-cases.
# However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
replicas: 1
# logical group name, for this called head-group, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
rayStartParams:
port: '6379'
#include_webui: 'true'
object-store-memory: '100000000'
# webui_host: "10.1.2.60"
dashboard-host: '0.0.0.0'
memory: '2147483648'
node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
block: 'true'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster-sample # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or wroker
groupName: headgroup # will be injected if missing
# annotations for pod
annotations:
key: value
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.4.0-gpu
imagePullPolicy: Always
#image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
cpu: 1
memory: 10Gi
requests:
cpu: 1
memory: 10Gi
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
- containerPort: 52365
name: dashboard-agent
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# if worker pods need to be added, we can simply increment the replicas
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
# the operator will remove pods from the list until the number of replicas is satisfied
# when a pod is confirmed to be deleted, its name will be removed from the list below
#scaleStrategy:
# workersToDelete:
# - raycluster-complete-worker-small-group-bdtwh
# - raycluster-complete-worker-small-group-hv457
# - raycluster-complete-worker-small-group-k8tj7
# the following params are used to complete the ray start: ray start --block --node-ip-address= ...
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
#pod template
template:
metadata:
labels:
key: value
# annotations for pod
annotations:
key: value
spec:
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray-ml:2.4.0-gpu
imagePullPolicy: Always
# environment variables to set in the container.Optional.
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 80
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
I have tested it locally and it seems to work
Here are my testing steps from inside application_root_dir
- serve build --multi-app deployments:model_name -o config.yaml
- ray start --head
- serve deploy config.yaml
- serve status:
name: app1
app_status:
status: RUNNING
message: ''
deployment_timestamp: 1683592551.1961145
deployment_statuses:
- name: app1
status: HEALTHY
message: ''
However, when I try to deploy to my remote kubernetes ray cluster, I run
serve deploy config.yaml --address {remote cluster dashboard agent address}
And then when I check the serve status, I see this error:
Deploying app 'app1' failed:
e[36mray::deploy_serve_application()e[39m (pid=5108, ip=172.31.107.146)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py", line 938, in deploy_serve_application
app = build(import_attr(import_path), name)
File "/home/ray/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'deployments'
Any thoughts on why this could be/any workarounds? Would really appreciate your help