Ray on AKS using Kubernetes Job with runtime_env working_dir throws error

I deployed Ray on AKS using the Ray Helm chart with a custom image based on rayproject/ray:1.9.2-py38. (all images used are based on this same base image for consistency).

I want to deploy a Ray Serve endpoint using a K8S Job (I need to use Job rather than running code locally and connecting to Ray through port-forwarding because of organizational constraints). I need to include additional .py files which contain helper functions for my main.py file to run and deploy to Ray Serve. I am having difficulty here. I am open to all suggestions. I have tried to ray.init(address="ray://isc-ray-cluster-ray-head:10001", runtime_env = {"."} as noted in the in other topics here and in the docs and indeed this seems to be make the most sense, however, I am encountering the following error message (note my ā€œmain.pyā€ is actually called deploy_iris.py and yes I am deploying that olā€™ faithfull iris model :wink: ):

Traceback (most recent call last):
  File "deploy_iris.py", line 30, in <module>
    ray.init(address="ray://isc-ray-cluster-ray-head:10001",
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 775, in init
    return builder.connect()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/client_builder.py", line 151, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client_connect.py", line 33, in connect
    conn = ray.connect(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/__init__.py", line 228, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/__init__.py", line 88, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/worker.py", line 683, in _server_init
    runtime_env = upload_working_dir_if_needed(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/working_dir.py", line 44, in upload_working_dir_if_needed
    working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 322, in get_uri_for_directory
    hash_val = _hash_directory(directory, directory,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 119, in _hash_directory
    _dir_travel(root, excludes, handler, logger=logger)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 86, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 86, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 86, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  [Previous line repeated 2 more times]
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 83, in _dir_travel
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 80, in _dir_travel
    handler(path)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/packaging.py", line 110, in handler
    with path.open("rb") as f:
  File "/home/ray/anaconda3/lib/python3.8/pathlib.py", line 1218, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/home/ray/anaconda3/lib/python3.8/pathlib.py", line 1074, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/anaconda3/pkgs/python-3.8.5-h7579374_1/compiler_compat/ld'

my K8S job Dockerfile is as follows:

FROM XXXX.azurecr.io/isc-ray-base:latest 
COPY deploy_iris.py deploy_iris.py
COPY utils_auth.py utils_auth.py
CMD ["python", "deploy_iris.py"] 

where isc-ray-base is custom image built from rayproject/ray:1.9.2-py38. but adding additional packages we need.

p.s.: brute forcing it by simply placing all my code in a single large main.py and removing the runtime_env from ray.init works as expected.

Happy to provide more details as needed! Iā€™m not sure if this is a bug or I am not doing something correctly/missing some lines in my Dockerfile

Thank you so much in advance!

1 Like

If possible can share some more context from deploy_iris.py (or put together a small repro), as well as check that ā€˜/home/ray/anaconda3/pkgs/python-3.8.5-h7579374_1/compiler_compat/ldā€™ exists?

cc @architkulkarni looks like the errorā€™s happening in the runtime_env/packaging path, any guesses here?

Thanks @ckw017 (and thanks in advance @architkulkarni) - see below for a simplified deploy_iris.py.

Letā€™s assume that utils.py is a file in the working_dir which contains additional code needed - in this example it just contains one function to decode model results and I import it in inference method.

The example below also assumes that the model is defined in a txt file called lightgbm_iris_model.txt which would normally be pulled from a model registry but in this case also highlights the issue since it can be stored in working_dir as well.

import ray
from ray import serve
from fastapi import FastAPI, Request
import numpy as np
import pandas as pd  
import lightgbm as lgb
runtime_env = {"working_dir": "."}
# Connect to the running Ray cluster on AKS
ray.init(address="ray://ray-cluster-ray-head:10001",
            namespace="serve", _metrics_export_port=8080,
            runtime_env = runtime_env
            ) ## on AKS
# Bind on 0.0.0.0 to expose the HTTP server on external IPs.
serve.start(detached=True, http_options={"host": "0.0.0.0"})
## Create FastAPI App
app = FastAPI()
@serve.deployment(route_prefix="/iris", num_replicas = 4, max_concurrent_queries = 100)
@serve.ingress(app)
class IrisModel:
    def __init__(self):        
        ### Import out custom auth library        
        self.model = lgb.Booster(model_file='lightgbm_iris_model.txt')  # init model
        ## Hard Code to Label Names
        self.label_decoder = {0:'Versicolor',1:'Setosa',3:'Virginica'}            
    @app.get("/")    
    async def inference(self, request: Request):      
        from utils import decode      
        data = await request.json()  
        estimates = self.model.predict(pd.DataFrame.from_dict(data))
        results = decode(np.argmax(estimates, axis = 1) , self.label_decoder)
        return results.tolist()
## Deploy Deployment
IrisModel.options(name="IrisModel-v1").deploy()

I also had a quick look on the head and worker node and they both seem to have /home/ray/anaconda3/pkgs/python-3.8.5-h7579374_1/compiler_compat/ldā€™ and I also had a look at the running AKS Job container and it also has it - not sure why the error of not being able to find it then?

Thanks again
Adam

Thanks so much for the details, this does look like it might be some sort of bugā€“I havenā€™t seen it before. It looks like itā€™s failing at the ray.init() line before it gets to any of the Ray Serve stuff, is that right?

The traceback is actually coming from your local machine, not from the cluster. @akelloway you mentioned /home/ray/anaconda3/pkgs/python-3.8.5-h7579374_1/compiler_compat/ld was present on all cluster nodes, but is it present on your local machine?

@architkulkarni - I think you are right that it is failing at the ray.init line.

In my particular situation I need to execute the deploy_iris.py code not from my local machine but from a docker container (the Dockerfile is in the original post) running as a Kubernetes Job - this is submitted to the AKS cluster through an Azure DevOps pipeline task - kubectl create -f job.yaml. I think this means my local machine is totally out of the loop, right? (honestly I am not 100% sure here but as best as I understand it) I checked the Kubernetes Job container and it also seems to have the /compiler_compat/ld stuff.

Thanks again for all the help.

Ah my mistake, I see you mentioned that in the original post. Yes, youā€™re right, your local machine is out of the picture, and what I thought of as the ā€œlocal machineā€ is your Kubernetes job container.

Looking at the log line raise e in what you posted, the line preceding that in _dir_travel prints a log: logger.error(f"Issue with path: {path}") packaging.py - ray-project/ray - Sourcegraph.

Can you search the logs for this line to see which path was giving the issue? The logs would be in /tmp/ray/session_latest/logs on the machine that called ray.init(). (Iā€™m not sure which file, maybe you can search through the whole directory?)

What does the "." directory contain? I wonder if it has unrelated files which are somehow getting deleted while the working_dir is being packaged and uploaded to the cluster. Perhaps you can try putting your relevant files in a subdirectory with nothing else included, and setting working_dir just to that subdirectory.

Thanks @architkulkarni looks like the solution was as you said to be specific about working_dir. For all others coming across this issue this is what seems to work for me right now is setting "working_dir":"/home/ray/src" (for example)
and then in my dockerfile I simply add

RUN mkdir /home/ray/src
WORKDIR /home/ray/src

before copying over all the necessary code files.

Thanks again @architkulkarni and @ckw017 for all the help.

1 Like