How to correctly build a Ray Serve server in Docker with a generic Ubuntu image (x86_64 in an amd system)?

I have been trying to build a Docker container for a Ray Serve server using the following Dockerfile:

FROM ubuntu:latest

# Install Python and other necessary packages
RUN apt-get update && \
    apt-get install -y python3-pip python3-dev build-essential

# Install FastAPI and Uvicorn
COPY requirements.txt /app/requirements.txt
RUN pip3 install --upgrade pip
RUN pip3 install -r /app/requirements.txt

# Copy the application code
COPY ./app /app
# COPY ./models /models

WORKDIR /app

# Expose the ports FastAPI will run on
EXPOSE 8265
EXPOSE 8001

# Command to run the Ray Serve
CMD ["python3", "main.py"]

The main.py is below:

import ray
from ray import serve
from fastapi import FastAPI
from starlette.requests import Request
from starlette.responses import JSONResponse
from llama_index.llms.llama_cpp import LlamaCPP
import subprocess, uvicorn

subprocess.run(["ray", "start", "--head", "--node-ip-address", "0.0.0.0" ,"--port", "8001"])

app = FastAPI()

# Initialize Ray
ray.init(address="auto", namespace="llama")
serve.shutdown() 
# serve.shutdown()
serve.start(detached=True)

@serve.deployment()
@serve.ingress(app)
class LlamaModelDeployment:
    def __init__(self, model_path):
        # Load the model (adjust the model path as necessary)
        self.model = LlamaCPP(
            model_path=model_path,
            temperature=0.2,
            max_new_tokens=1024,
            context_window=2048,
            model_kwargs={"n_gpu_layers": 0},
            verbose=True,
        )

    @app.get("/hello")
    async def root(self):
        return JSONResponse({"response": "Hello World"})

    @app.post("/llama")
    async def root(self, request: Request):
        data = await request.json()
        # Process the request and generate a response using the LlamaModel instance
        input_text = data.get("input", "")
        output = self.model.complete(input_text)
        return JSONResponse({"response": output.dict()})

if __name__ == "__main__":
    # Deploy the model
    model_path = "/models/phi-2.Q2_K.gguf"
    serve.run(LlamaModelDeployment.bind(model_path), route_prefix="/")
    
    # test server
    import requests
    resp = requests.get("http://0.0.0.0:8001/hello")
    print(resp)

    # don't let ray docker sleep
    import time
    while True:
        time.sleep(10)

The Docker Compose file is:

services:
  llamaserve:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: llamaserve_app
    volumes:
      - /path/to/models:/models
    ports:
      - "8265:8265"
      - "8001:8001"

I keep getting ‘Connection aborted.’, BadStatusLine errors. Running main.py outside the Docker container seems to work fine. So far, I have tried different base images (including ones provided by Ray), changing Docker configurations, reverting back to using ray serve instead of ingress, but nothing is working.

Any help will be appreciated!

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

After some more tinkering I found that you have to set http_options: serve.start(detached=True, http_options={“host”: “0.0.0.0”, “port”: 8001}), (note: change/remove ray start --port). Now it works!

@Sayan_Mandal any suggestions into improving our raydocs so this is more obvious when onboarding to ray?

I suspect this might be a minor bug in the Ray Serve implementation (or it could be intentional, depending on the authors’ intentions). By default, “ray start --head” uses port 8000, which works fine in a local environment, but it’s unclear which process is served inside the Docker container. The LLM serve port needs to be set in serve.start(http_options). I verified this by explicitly setting the port with “ray start --port ” and serve.start(http_options={‘port’: }), but this results in a “port not available” error.

This didn’t work for me unfortunately. I kept getting errors like ray had already been started, and then trying to fix that, unexpected shutdown errors.

@Sam_Chan Hey I found it really difficult to try to figure out how to get ray serve to listen on 0.0.0.0, which is required for docker compose configurations.

gives you an example using only serve run, but serve run doesn’t (as far as I can tell) allow you to change the host or port so it will only work in a single container setup.

serve run will happily run without ray start, it’ll init ray on its own.

But if I want to configure the host and port I need serve start, which will not run without ray start. You need to init ray manually. This inconsistency is not very user friendly.

Running serve start without ray start gives the error:
“ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address flag or RAY_ADDRESS environment variable.”

But it doesn’t tell you that you can run ray start instead of setting an --address (and it’s not clear from the CLI docs what ‘address’ it’s referring to or how to obtain such an address)

The api / cli commands should probably have some examples and some clearer explanation of what the options do

If anyone wants to know how I got my server to run, here’s what I needed to do:

‘Dockerfile’

COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT [ "bash", "/entrypoint.sh" ]

‘entrypoint.sh’

#!/bin/bash
ray start --head
serve start --http-host 0.0.0.0 --http-port 8000
serve run app.main:api

‘app/main.py’

@serve.deployment()
class YourAPI:
   ...

api = YourAPI.bind()