Build Custom Ray Docker image

chini · June 9, 2025, 4:27pm

1. Severity of the issue: (select one)
[x ] High: Completely blocks me.

2. Environment:

Ray version: 2.45.0
Python version: 3.10
OS: Ubuntu 22.04
Cloud/Infrastructure: GCP
Other libs/tools (if relevant): Check the requirements.txt and constraints.txt

3. What happened vs. what you expected:

Expected: When running a job on Ray cluster starting with the ray docker image, the head node should be able to scale out and distribute the workload.
Actual: When running a job on Ray cluster with the ray docker image, the tasks on the head node run without issues, while the scaling job fails. Logs below:

$ ray submit /opt/ray/config-docker-no-data.yaml scaling.py --verbose
2025-06-07 19:54:41,730	INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration from /tmp/ray-config-b8ca643054c58868c55bb0feb64a63327283e4a0
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2025-06-07 19:54:41,900 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-06-07 19:54:41,901 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
Fetched IP: 10.230.230.145
Running `mkdir -p /tmp/ray_tmp_mount/no-data-scaling/~ && chown -R ubuntu /tmp/ray_tmp_mount/no-data-scaling/~`
Warning: Permanently added '10.230.230.145' (ED25519) to the list of known hosts.
Shared connection to 10.230.230.145 closed.
Running `rsync --rsh ssh -i /home/biswalc/.ssh/ray-autoscaler_gcp_us-central1_rsc-general-computing_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4f0be890e6/e96c00e37e/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore scaling.py ubuntu@10.230.230.145:/tmp/ray_tmp_mount/no-data-scaling/~/scaling.py`
sending incremental file list
scaling.py

sent 938 bytes  received 35 bytes  1,946.00 bytes/sec
total size is 1,687  speedup is 1.73
Running `docker inspect -f '{{.State.Running}}' ray_shuffling || true`
Shared connection to 10.230.230.145 closed.
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it  ray_shuffling /bin/bash -c 'mkdir -p /root'  && rsync -e 'docker exec -i' -avz /tmp/ray_tmp_mount/no-data-scaling/~/scaling.py ray_shuffling:/root/scaling.py`
sending incremental file list
scaling.py

sent 936 bytes  received 35 bytes  1,942.00 bytes/sec
total size is 1,687  speedup is 1.74
Shared connection to 10.230.230.145 closed.
`rsync`ed scaling.py (local) to ~/scaling.py (remote)
2025-06-07 19:54:47,507	INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 10.230.230.145
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it  ray_shuffling /bin/bash -c 'bash --login -c -i '"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /root/scaling.py)'"'"'' `
2025-06-07 12:54:50,408	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.230.230.145:6379...
2025-06-07 12:54:50,422	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.230.230.145:8265
Initial cluster resources: {'CPU': 4.0, 'object_store_memory': 4625979801.0, 'node:__internal_head__': 1.0, 'node:10.239.230.145': 1.0, 'memory': 9251959604.0}
Requesting 100 CPU-intensive remote tasks...
(intense_cpu_task pid=471) Starting CPU task on 10.230.230.145
(autoscaler +1m11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +1m11s) Adding 2 node(s) of type ray_worker_med.
(autoscaler +3m28s) Removing 1 nodes of type ray_worker_med (launch failed).
(autoscaler +4m49s) Adding 1 node(s) of type ray_worker_med.
(autoscaler +4m49s) Removing 1 nodes of type ray_worker_med (launch failed).

How can I find additional details on why the ray scaling is failing. If I use the default image, rayproject/ray:latest-cpu the scaling works fine.

But when I use my own image built from scratch, it doesn’t scale. I want to find out what is missing in my Dockerfile.

Below is my Dockerfile:

FROM ubuntu:22.04

ARG PYTHON_VERSION="3.10"

ENV PYTHON_VERSION=${PYTHON_VERSION}
ENV AUTOSCALER="autoscaler"
ENV TZ="America/Los_Angeles"
ENV LC_ALL="C.UTF-8"
ENV LANG="C.UTF-8"
ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get update && apt-get install -y \
        python3-distutils \
        python3-testresources \
        cmake \
        curl \
        g++ \
        gcc \
        git \
        gnupg \
        libffi-dev \
        libjemalloc-dev \
        netbase \
        openssh-client \
        parallel \
        pkg-config \
        rsync \
        screen \
        sudo \
        tmux \
        tzdata \
        unzip \
        wget \
        zip \
        zlib1g-dev

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 0 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python${PYTHON_VERSION} 0

RUN python3 --version && \
    which python3 && \
    python --version && \
    which python

RUN curl -o get-pip.py https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py --user

RUN apt-get -y update && \
    apt-get clean autoclean && \
    apt-get autoremove -y --purge && \
    rm -rf /var/lib/apt/lists/*

COPY *.txt .

RUN PATH="${HOME}/.local/bin:$PATH" \
        python3 -m pip install \
        -r requirements.txt

requirements.txt

absl-py==2.3.0
    # via dm-tree
aiohappyeyeballs==2.6.1
    # via aiohttp
aiohttp==3.12.11
    # via
    #   -r requirements.in
    #   aiohttp-cors
aiohttp-cors==0.8.1
    # via -r requirements.in
aiorwlock==1.5.0
    # via -r requirements.in
aiosignal==1.3.2
    # via aiohttp
annotated-types==0.7.0
    # via pydantic
anyio==4.9.0
    # via
    #   starlette
    #   watchfiles
async-timeout==5.0.1
    # via aiohttp
attrs==25.3.0
    # via
    #   aiohttp
    #   dm-tree
    #   jsonschema
    #   referencing
cachetools==5.5.2
    # via google-auth
certifi==2025.4.26
    # via requests
cffi==1.17.1
    # via cryptography
charset-normalizer==3.4.2
    # via requests
click==8.2.1
    # via
    #   -r requirements.in
    #   ray
    #   typer
    #   uvicorn
cloudpickle==3.1.1
    # via gymnasium
colorful==0.5.6
    # via -r requirements.in
cryptography==45.0.3
    # via pyopenssl
cupy-cuda12x==13.4.1
    # via -r requirements.in
distlib==0.3.9
    # via virtualenv
dm-tree==0.1.9
    # via -r requirements.in
exceptiongroup==1.3.0
    # via anyio
farama-notifications==0.0.4
    # via gymnasium
fastapi==0.115.12
    # via -r requirements.in
fastrlock==0.8.3
    # via cupy-cuda12x
filelock==3.18.0
    # via
    #   -r requirements.in
    #   ray
    #   virtualenv
frozenlist==1.6.2
    # via
    #   aiohttp
    #   aiosignal
fsspec==2025.5.1
    # via -r requirements.in
google-api-core==2.25.0
    # via
    #   google-api-python-client
    #   opencensus
google-api-python-client==2.171.0
    # via -r requirements.in
google-auth==2.40.3
    # via
    #   google-api-core
    #   google-api-python-client
    #   google-auth-httplib2
google-auth-httplib2==0.2.0
    # via google-api-python-client
googleapis-common-protos==1.70.0
    # via
    #   google-api-core
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
grpcio==1.73.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
gymnasium==1.0.0
    # via -r requirements.in
h11==0.16.0
    # via uvicorn
httplib2==0.22.0
    # via
    #   google-api-python-client
    #   google-auth-httplib2
idna==3.10
    # via
    #   anyio
    #   requests
    #   yarl
imageio==2.37.0
    # via scikit-image
importlib-metadata==8.7.0
    # via opentelemetry-api
jinja2==3.1.6
    # via memray
jsonschema==4.24.0
    # via
    #   -r requirements.in
    #   ray
jsonschema-specifications==2025.4.1
    # via jsonschema
lazy-loader==0.4
    # via scikit-image
linkify-it-py==2.0.3
    # via markdown-it-py
lz4==4.4.4
    # via -r requirements.in
markdown-it-py[linkify,plugins]==3.0.0
    # via
    #   mdit-py-plugins
    #   rich
    #   textual
markupsafe==3.0.2
    # via jinja2
mdit-py-plugins==0.4.2
    # via markdown-it-py
mdurl==0.1.2
    # via markdown-it-py
memray==1.17.2
    # via -r requirements.in
msgpack==1.1.0
    # via
    #   -r requirements.in
    #   ray
multidict==6.4.4
    # via
    #   aiohttp
    #   yarl
networkx==3.4.2
    # via scikit-image
numpy==2.2.6
    # via
    #   -r requirements.in
    #   cupy-cuda12x
    #   dm-tree
    #   gymnasium
    #   imageio
    #   pandas
    #   scikit-image
    #   scipy
    #   tensorboardx
    #   tifffile
opencensus==0.11.4
    # via -r requirements.in
opencensus-context==0.1.3
    # via opencensus
opentelemetry-api==1.34.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
    #   opentelemetry-sdk
    #   opentelemetry-semantic-conventions
opentelemetry-exporter-otlp==1.34.0
    # via -r requirements.in
opentelemetry-exporter-otlp-proto-common==1.34.0
    # via
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-grpc==1.34.0
    # via opentelemetry-exporter-otlp
opentelemetry-exporter-otlp-proto-http==1.34.0
    # via opentelemetry-exporter-otlp
opentelemetry-proto==1.34.0
    # via
    #   opentelemetry-exporter-otlp-proto-common
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-sdk==1.34.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-semantic-conventions==0.55b0
    # via opentelemetry-sdk
packaging==25.0
    # via
    #   -r requirements.in
    #   lazy-loader
    #   ray
    #   scikit-image
    #   tensorboardx
pandas==2.3.0
    # via -r requirements.in
pillow==11.2.1
    # via
    #   imageio
    #   scikit-image
platformdirs==4.3.8
    # via
    #   textual
    #   virtualenv
prometheus-client==0.22.1
    # via -r requirements.in
propcache==0.3.1
    # via
    #   aiohttp
    #   yarl
proto-plus==1.26.1
    # via google-api-core
protobuf==5.29.5
    # via
    #   -r requirements.in
    #   google-api-core
    #   googleapis-common-protos
    #   opentelemetry-proto
    #   proto-plus
    #   ray
    #   tensorboardx
py-spy==0.4.0
    # via -r requirements.in
pyarrow==20.0.0
    # via -r requirements.in
pyasn1==0.6.1
    # via
    #   pyasn1-modules
    #   rsa
pyasn1-modules==0.4.2
    # via google-auth
pycparser==2.22
    # via cffi
pydantic==2.11.5
    # via
    #   -r requirements.in
    #   fastapi
pydantic-core==2.33.2
    # via pydantic
pygments==2.19.1
    # via rich
pyopenssl==25.1.0
    # via -r requirements.in
pyparsing==3.2.3
    # via httplib2
python-dateutil==2.9.0.post0
    # via pandas
pytz==2025.2
    # via pandas
pyyaml==6.0.2
    # via
    #   -r requirements.in
    #   ray
ray==2.45.0
    # via -r requirements.in
referencing==0.36.2
    # via
    #   jsonschema
    #   jsonschema-specifications
requests==2.32.4
    # via
    #   -r requirements.in
    #   google-api-core
    #   opentelemetry-exporter-otlp-proto-http
    #   ray
rich==14.0.0
    # via
    #   -r requirements.in
    #   memray
    #   textual
    #   typer
rpds-py==0.25.1
    # via
    #   jsonschema
    #   referencing
rsa==4.9.1
    # via google-auth
scikit-image==0.25.2
    # via -r requirements.in
scipy==1.15.3
    # via
    #   -r requirements.in
    #   scikit-image
shellingham==1.5.4
    # via typer
six==1.17.0
    # via
    #   opencensus
    #   python-dateutil
smart-open==7.1.0
    # via -r requirements.in
sniffio==1.3.1
    # via anyio
starlette==0.46.2
    # via
    #   -r requirements.in
    #   fastapi
tensorboardx==2.6.2.2
    # via -r requirements.in
textual==3.3.0
    # via memray
tifffile==2025.5.10
    # via scikit-image
typer==0.16.0
    # via -r requirements.in
typing-extensions==4.14.0
    # via
    #   anyio
    #   exceptiongroup
    #   fastapi
    #   gymnasium
    #   multidict
    #   opentelemetry-api
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
    #   opentelemetry-sdk
    #   opentelemetry-semantic-conventions
    #   pydantic
    #   pydantic-core
    #   pyopenssl
    #   referencing
    #   rich
    #   textual
    #   typer
    #   typing-inspection
    #   uvicorn
typing-inspection==0.4.1
    # via pydantic
tzdata==2025.2
    # via pandas
uc-micro-py==1.0.3
    # via linkify-it-py
uritemplate==4.2.0
    # via google-api-python-client
urllib3==2.4.0
    # via requests
uvicorn==0.34.3
    # via -r requirements.in
virtualenv==20.31.2
    # via -r requirements.in
watchfiles==1.0.5
    # via -r requirements.in
wrapt==1.17.2
    # via
    #   dm-tree
    #   smart-open
yarl==1.20.0
    # via aiohttp
zipp==3.23.0
    # via importlib-metadata

christina · June 10, 2025, 12:01am

Hello! We have an example Docker image you can see here: Image requirements | Anyscale Docs

Can you take a look and compare it to your existing Dockerfile and see if it’s missing anything? From a glance I can see it is already missing the part where it sets up the ray user (su --login ray) so try to copy over the missing parts and see if that helps the problem.

chini · June 10, 2025, 11:39am

Thank you for looking into this.
Following the file in ray/docker/base-deps/Dockerfile at releases/2.45.0 · ray-project/ray · GitHub
It seems there is no need to use the user ray at this point.

I want to find out when I submit ray submit config-docker-no-data.yaml scaling.py --verbose, what really happens when the below log is printed:
(autoscaler +2m41s) Removing 1 nodes of type ray_worker_med (launch failed).

I can continue on the troubleshooting path with that information.

Strange thing is if the cluster uses rayproject/ray:latest-cpu the scaling works fine. But if I use the Dockerfile from the repo at ray/docker/base-deps/Dockerfile at releases/2.45.0 · ray-project/ray · GitHub the autoscaling fails. I want to deepdive into why this is happening.

Topic		Replies	Views
How to use my own docker image to run a local on-Premise cluster? Ray Clusters	1	876	January 5, 2022
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	898	July 2, 2022
Ray cluster does is not creating workers? Ray Core	22	2891	April 26, 2021
Ray Worker pod stuck at init stage and unable to be created Ray Clusters	8	668	August 7, 2024
[Medium] Using docker image for service deployment Ray Serve	7	830	December 29, 2023

Build Custom Ray Docker image

Related topics