How to correctly build a Ray Serve server in Docker with a generic Ubuntu image (x86_64 in an amd system)?

I have been trying to build a Docker container for a Ray Serve server using the following Dockerfile:

FROM ubuntu:latest

# Install Python and other necessary packages
RUN apt-get update && \
    apt-get install -y python3-pip python3-dev build-essential

# Install FastAPI and Uvicorn
COPY requirements.txt /app/requirements.txt
RUN pip3 install --upgrade pip
RUN pip3 install -r /app/requirements.txt

# Copy the application code
COPY ./app /app
# COPY ./models /models

WORKDIR /app

# Expose the ports FastAPI will run on
EXPOSE 8265
EXPOSE 8001

# Command to run the Ray Serve
CMD ["python3", "main.py"]

The main.py is below:

import ray
from ray import serve
from fastapi import FastAPI
from starlette.requests import Request
from starlette.responses import JSONResponse
from llama_index.llms.llama_cpp import LlamaCPP
import subprocess, uvicorn

subprocess.run(["ray", "start", "--head", "--node-ip-address", "0.0.0.0" ,"--port", "8001"])

app = FastAPI()

# Initialize Ray
ray.init(address="auto", namespace="llama")
serve.shutdown() 
# serve.shutdown()
serve.start(detached=True)

@serve.deployment()
@serve.ingress(app)
class LlamaModelDeployment:
    def __init__(self, model_path):
        # Load the model (adjust the model path as necessary)
        self.model = LlamaCPP(
            model_path=model_path,
            temperature=0.2,
            max_new_tokens=1024,
            context_window=2048,
            model_kwargs={"n_gpu_layers": 0},
            verbose=True,
        )

    @app.get("/hello")
    async def root(self):
        return JSONResponse({"response": "Hello World"})

    @app.post("/llama")
    async def root(self, request: Request):
        data = await request.json()
        # Process the request and generate a response using the LlamaModel instance
        input_text = data.get("input", "")
        output = self.model.complete(input_text)
        return JSONResponse({"response": output.dict()})

if __name__ == "__main__":
    # Deploy the model
    model_path = "/models/phi-2.Q2_K.gguf"
    serve.run(LlamaModelDeployment.bind(model_path), route_prefix="/")
    
    # test server
    import requests
    resp = requests.get("http://0.0.0.0:8001/hello")
    print(resp)

    # don't let ray docker sleep
    import time
    while True:
        time.sleep(10)

The Docker Compose file is:

services:
  llamaserve:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: llamaserve_app
    volumes:
      - /path/to/models:/models
    ports:
      - "8265:8265"
      - "8001:8001"

I keep getting ‘Connection aborted.’, BadStatusLine errors. Running main.py outside the Docker container seems to work fine. So far, I have tried different base images (including ones provided by Ray), changing Docker configurations, reverting back to using ray serve instead of ingress, but nothing is working.

Any help will be appreciated!

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

After some more tinkering I found that you have to set http_options: serve.start(detached=True, http_options={“host”: “0.0.0.0”, “port”: 8001}), (note: change/remove ray start --port). Now it works!

@Sayan_Mandal any suggestions into improving our raydocs so this is more obvious when onboarding to ray?

I suspect this might be a minor bug in the Ray Serve implementation (or it could be intentional, depending on the authors’ intentions). By default, “ray start --head” uses port 8000, which works fine in a local environment, but it’s unclear which process is served inside the Docker container. The LLM serve port needs to be set in serve.start(http_options). I verified this by explicitly setting the port with “ray start --port ” and serve.start(http_options={‘port’: }), but this results in a “port not available” error.