Serving custom-built containers hanging on deployment

1. Severity of the issue:
Medium: Affects my productivity but can find a workaround.

To start, my objective is to do a multi-app container setup like here on the docs: (Run Multiple Applications in Different Containers — Ray 2.47.1). We don’t have access to the public Ray Image and have to build our own images for everything (so I’m using oracle linux, which I believe is Fedora-based). I’ve made a baseline ray image (installed ray[serve-grpc]), and the dashboard come up and seems to be healthy. However, after the cluster comes up and it gets to deploying any applications, they hang on deploying forever.

In my RayService YAML the spec for both the head and worker are something like so:

spec:
 containers:
   - name: ray-worker # (or ray-head)
     image: {custom base ray image}:latest

On the application side, applications are not built from the my base docker container, but use the same exact same docker steps from the same base image (an artifact of our build pipeline) and after installing the same ray version, some custom dependencies. They are extremely lightweight serve applications that use the FastAPI integration and are similar to the example in the docs. The only notable difference from the docs are that the copy command in the application dockerfile does not set the working dir for the container to /serve_app like the docs say, instead I set it to /app, which didn’t seem like it mattered (yet). Also the serve applications do work outside of docker containers on a local cluster.

In my ServeConfigV2, I followed the app_builder pattern with Pydantic models with all optional args that return an Application. They have a deployment section that also asks for a single replica and the cluster definitely has capacity. I should mention I’m trying the default 1.3.0 KubeRay operator for this test and not my custom one.

applications:
  - name: cool_app
    import_path: cool_app:app_builder
    route_prefix: /cool_app
    runtime_env:
      image_uri: {Custom app image 1}
  - name: hello_world
    import_path: hello_world:app_builder
    route_prefix: /hello
    runtime_env:
      image_uri: {Custom app image 2}

I don’t get any logs after the configuring application line on the controller, and deployments logs are empty on the dashboard. I had to install a number of fedora dependencies for my custom version to get to this point (podman, passt, and fuse). I know that images can be pulled (baseline ray image thats pulled when cluster is created is in the same CR). I’ve even tried the baseline FastAPI + Hello World code on the docs with the same effect, which seems to suggest it has something to do with my custom image.

However, I also tried standing up a cluster with standard rayproject/ray:2.47.1 nodes in my YAML and only used my custom-built application containers, but got errors about podman not being installed, so not sure if I need to extend the ray base image to pull my containers.

2. Environment:

  • Ray version: 2.47.1
  • Python version: 3.12
  • OS: X86
  • Cloud/Infrastructure: OCI
  • Other libs/tools (if relevant):