Build Custom Ray Docker image

1. Severity of the issue: (select one)
[x ] High: Completely blocks me.

2. Environment:

  • Ray version: 2.45.0
  • Python version: 3.10
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure: GCP
  • Other libs/tools (if relevant): Check the requirements.txt and constraints.txt

3. What happened vs. what you expected:

  • Expected: When running a job on Ray cluster starting with the ray docker image, the head node should be able to scale out and distribute the workload.
  • Actual: When running a job on Ray cluster with the ray docker image, the tasks on the head node run without issues, while the scaling job fails. Logs below:
$ ray submit /opt/ray/config-docker-no-data.yaml scaling.py --verbose
2025-06-07 19:54:41,730	INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration from /tmp/ray-config-b8ca643054c58868c55bb0feb64a63327283e4a0
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2025-06-07 19:54:41,900 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-06-07 19:54:41,901 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
Fetched IP: 10.230.230.145
Running `mkdir -p /tmp/ray_tmp_mount/no-data-scaling/~ && chown -R ubuntu /tmp/ray_tmp_mount/no-data-scaling/~`
Warning: Permanently added '10.230.230.145' (ED25519) to the list of known hosts.
Shared connection to 10.230.230.145 closed.
Running `rsync --rsh ssh -i /home/biswalc/.ssh/ray-autoscaler_gcp_us-central1_rsc-general-computing_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4f0be890e6/e96c00e37e/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore scaling.py ubuntu@10.230.230.145:/tmp/ray_tmp_mount/no-data-scaling/~/scaling.py`
sending incremental file list
scaling.py

sent 938 bytes  received 35 bytes  1,946.00 bytes/sec
total size is 1,687  speedup is 1.73
Running `docker inspect -f '{{.State.Running}}' ray_shuffling || true`
Shared connection to 10.230.230.145 closed.
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it  ray_shuffling /bin/bash -c 'mkdir -p /root'  && rsync -e 'docker exec -i' -avz /tmp/ray_tmp_mount/no-data-scaling/~/scaling.py ray_shuffling:/root/scaling.py`
sending incremental file list
scaling.py

sent 936 bytes  received 35 bytes  1,942.00 bytes/sec
total size is 1,687  speedup is 1.74
Shared connection to 10.230.230.145 closed.
`rsync`ed scaling.py (local) to ~/scaling.py (remote)
2025-06-07 19:54:47,507	INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 10.230.230.145
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it  ray_shuffling /bin/bash -c 'bash --login -c -i '"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /root/scaling.py)'"'"'' `
2025-06-07 12:54:50,408	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.230.230.145:6379...
2025-06-07 12:54:50,422	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.230.230.145:8265
Initial cluster resources: {'CPU': 4.0, 'object_store_memory': 4625979801.0, 'node:__internal_head__': 1.0, 'node:10.239.230.145': 1.0, 'memory': 9251959604.0}
Requesting 100 CPU-intensive remote tasks...
(intense_cpu_task pid=471) Starting CPU task on 10.230.230.145
(autoscaler +1m11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +1m11s) Adding 2 node(s) of type ray_worker_med.
(autoscaler +3m28s) Removing 1 nodes of type ray_worker_med (launch failed).
(autoscaler +4m49s) Adding 1 node(s) of type ray_worker_med.
(autoscaler +4m49s) Removing 1 nodes of type ray_worker_med (launch failed).

How can I find additional details on why the ray scaling is failing. If I use the default image, rayproject/ray:latest-cpu the scaling works fine.

But when I use my own image built from scratch, it doesn’t scale. I want to find out what is missing in my Dockerfile.

Below is my Dockerfile:

FROM ubuntu:22.04

ARG PYTHON_VERSION="3.10"

ENV PYTHON_VERSION=${PYTHON_VERSION}
ENV AUTOSCALER="autoscaler"
ENV TZ="America/Los_Angeles"
ENV LC_ALL="C.UTF-8"
ENV LANG="C.UTF-8"
ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get update && apt-get install -y \
        python3-distutils \
        python3-testresources \
        cmake \
        curl \
        g++ \
        gcc \
        git \
        gnupg \
        libffi-dev \
        libjemalloc-dev \
        netbase \
        openssh-client \
        parallel \
        pkg-config \
        rsync \
        screen \
        sudo \
        tmux \
        tzdata \
        unzip \
        wget \
        zip \
        zlib1g-dev

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 0 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python${PYTHON_VERSION} 0

RUN python3 --version && \
    which python3 && \
    python --version && \
    which python

RUN curl -o get-pip.py https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py --user

RUN apt-get -y update && \
    apt-get clean autoclean && \
    apt-get autoremove -y --purge && \
    rm -rf /var/lib/apt/lists/*

COPY *.txt .

RUN PATH="${HOME}/.local/bin:$PATH" \
        python3 -m pip install \
        -r requirements.txt

requirements.txt

absl-py==2.3.0
    # via dm-tree
aiohappyeyeballs==2.6.1
    # via aiohttp
aiohttp==3.12.11
    # via
    #   -r requirements.in
    #   aiohttp-cors
aiohttp-cors==0.8.1
    # via -r requirements.in
aiorwlock==1.5.0
    # via -r requirements.in
aiosignal==1.3.2
    # via aiohttp
annotated-types==0.7.0
    # via pydantic
anyio==4.9.0
    # via
    #   starlette
    #   watchfiles
async-timeout==5.0.1
    # via aiohttp
attrs==25.3.0
    # via
    #   aiohttp
    #   dm-tree
    #   jsonschema
    #   referencing
cachetools==5.5.2
    # via google-auth
certifi==2025.4.26
    # via requests
cffi==1.17.1
    # via cryptography
charset-normalizer==3.4.2
    # via requests
click==8.2.1
    # via
    #   -r requirements.in
    #   ray
    #   typer
    #   uvicorn
cloudpickle==3.1.1
    # via gymnasium
colorful==0.5.6
    # via -r requirements.in
cryptography==45.0.3
    # via pyopenssl
cupy-cuda12x==13.4.1
    # via -r requirements.in
distlib==0.3.9
    # via virtualenv
dm-tree==0.1.9
    # via -r requirements.in
exceptiongroup==1.3.0
    # via anyio
farama-notifications==0.0.4
    # via gymnasium
fastapi==0.115.12
    # via -r requirements.in
fastrlock==0.8.3
    # via cupy-cuda12x
filelock==3.18.0
    # via
    #   -r requirements.in
    #   ray
    #   virtualenv
frozenlist==1.6.2
    # via
    #   aiohttp
    #   aiosignal
fsspec==2025.5.1
    # via -r requirements.in
google-api-core==2.25.0
    # via
    #   google-api-python-client
    #   opencensus
google-api-python-client==2.171.0
    # via -r requirements.in
google-auth==2.40.3
    # via
    #   google-api-core
    #   google-api-python-client
    #   google-auth-httplib2
google-auth-httplib2==0.2.0
    # via google-api-python-client
googleapis-common-protos==1.70.0
    # via
    #   google-api-core
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
grpcio==1.73.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
gymnasium==1.0.0
    # via -r requirements.in
h11==0.16.0
    # via uvicorn
httplib2==0.22.0
    # via
    #   google-api-python-client
    #   google-auth-httplib2
idna==3.10
    # via
    #   anyio
    #   requests
    #   yarl
imageio==2.37.0
    # via scikit-image
importlib-metadata==8.7.0
    # via opentelemetry-api
jinja2==3.1.6
    # via memray
jsonschema==4.24.0
    # via
    #   -r requirements.in
    #   ray
jsonschema-specifications==2025.4.1
    # via jsonschema
lazy-loader==0.4
    # via scikit-image
linkify-it-py==2.0.3
    # via markdown-it-py
lz4==4.4.4
    # via -r requirements.in
markdown-it-py[linkify,plugins]==3.0.0
    # via
    #   mdit-py-plugins
    #   rich
    #   textual
markupsafe==3.0.2
    # via jinja2
mdit-py-plugins==0.4.2
    # via markdown-it-py
mdurl==0.1.2
    # via markdown-it-py
memray==1.17.2
    # via -r requirements.in
msgpack==1.1.0
    # via
    #   -r requirements.in
    #   ray
multidict==6.4.4
    # via
    #   aiohttp
    #   yarl
networkx==3.4.2
    # via scikit-image
numpy==2.2.6
    # via
    #   -r requirements.in
    #   cupy-cuda12x
    #   dm-tree
    #   gymnasium
    #   imageio
    #   pandas
    #   scikit-image
    #   scipy
    #   tensorboardx
    #   tifffile
opencensus==0.11.4
    # via -r requirements.in
opencensus-context==0.1.3
    # via opencensus
opentelemetry-api==1.34.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
    #   opentelemetry-sdk
    #   opentelemetry-semantic-conventions
opentelemetry-exporter-otlp==1.34.0
    # via -r requirements.in
opentelemetry-exporter-otlp-proto-common==1.34.0
    # via
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-grpc==1.34.0
    # via opentelemetry-exporter-otlp
opentelemetry-exporter-otlp-proto-http==1.34.0
    # via opentelemetry-exporter-otlp
opentelemetry-proto==1.34.0
    # via
    #   opentelemetry-exporter-otlp-proto-common
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-sdk==1.34.0
    # via
    #   -r requirements.in
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
opentelemetry-semantic-conventions==0.55b0
    # via opentelemetry-sdk
packaging==25.0
    # via
    #   -r requirements.in
    #   lazy-loader
    #   ray
    #   scikit-image
    #   tensorboardx
pandas==2.3.0
    # via -r requirements.in
pillow==11.2.1
    # via
    #   imageio
    #   scikit-image
platformdirs==4.3.8
    # via
    #   textual
    #   virtualenv
prometheus-client==0.22.1
    # via -r requirements.in
propcache==0.3.1
    # via
    #   aiohttp
    #   yarl
proto-plus==1.26.1
    # via google-api-core
protobuf==5.29.5
    # via
    #   -r requirements.in
    #   google-api-core
    #   googleapis-common-protos
    #   opentelemetry-proto
    #   proto-plus
    #   ray
    #   tensorboardx
py-spy==0.4.0
    # via -r requirements.in
pyarrow==20.0.0
    # via -r requirements.in
pyasn1==0.6.1
    # via
    #   pyasn1-modules
    #   rsa
pyasn1-modules==0.4.2
    # via google-auth
pycparser==2.22
    # via cffi
pydantic==2.11.5
    # via
    #   -r requirements.in
    #   fastapi
pydantic-core==2.33.2
    # via pydantic
pygments==2.19.1
    # via rich
pyopenssl==25.1.0
    # via -r requirements.in
pyparsing==3.2.3
    # via httplib2
python-dateutil==2.9.0.post0
    # via pandas
pytz==2025.2
    # via pandas
pyyaml==6.0.2
    # via
    #   -r requirements.in
    #   ray
ray==2.45.0
    # via -r requirements.in
referencing==0.36.2
    # via
    #   jsonschema
    #   jsonschema-specifications
requests==2.32.4
    # via
    #   -r requirements.in
    #   google-api-core
    #   opentelemetry-exporter-otlp-proto-http
    #   ray
rich==14.0.0
    # via
    #   -r requirements.in
    #   memray
    #   textual
    #   typer
rpds-py==0.25.1
    # via
    #   jsonschema
    #   referencing
rsa==4.9.1
    # via google-auth
scikit-image==0.25.2
    # via -r requirements.in
scipy==1.15.3
    # via
    #   -r requirements.in
    #   scikit-image
shellingham==1.5.4
    # via typer
six==1.17.0
    # via
    #   opencensus
    #   python-dateutil
smart-open==7.1.0
    # via -r requirements.in
sniffio==1.3.1
    # via anyio
starlette==0.46.2
    # via
    #   -r requirements.in
    #   fastapi
tensorboardx==2.6.2.2
    # via -r requirements.in
textual==3.3.0
    # via memray
tifffile==2025.5.10
    # via scikit-image
typer==0.16.0
    # via -r requirements.in
typing-extensions==4.14.0
    # via
    #   anyio
    #   exceptiongroup
    #   fastapi
    #   gymnasium
    #   multidict
    #   opentelemetry-api
    #   opentelemetry-exporter-otlp-proto-grpc
    #   opentelemetry-exporter-otlp-proto-http
    #   opentelemetry-sdk
    #   opentelemetry-semantic-conventions
    #   pydantic
    #   pydantic-core
    #   pyopenssl
    #   referencing
    #   rich
    #   textual
    #   typer
    #   typing-inspection
    #   uvicorn
typing-inspection==0.4.1
    # via pydantic
tzdata==2025.2
    # via pandas
uc-micro-py==1.0.3
    # via linkify-it-py
uritemplate==4.2.0
    # via google-api-python-client
urllib3==2.4.0
    # via requests
uvicorn==0.34.3
    # via -r requirements.in
virtualenv==20.31.2
    # via -r requirements.in
watchfiles==1.0.5
    # via -r requirements.in
wrapt==1.17.2
    # via
    #   dm-tree
    #   smart-open
yarl==1.20.0
    # via aiohttp
zipp==3.23.0
    # via importlib-metadata

Hello! We have an example Docker image you can see here: Image requirements | Anyscale Docs

Can you take a look and compare it to your existing Dockerfile and see if it’s missing anything? From a glance I can see it is already missing the part where it sets up the ray user (su --login ray) so try to copy over the missing parts and see if that helps the problem.

Thank you for looking into this.
Following the file in ray/docker/base-deps/Dockerfile at releases/2.45.0 · ray-project/ray · GitHub
It seems there is no need to use the user ray at this point.

I want to find out when I submit ray submit config-docker-no-data.yaml scaling.py --verbose, what really happens when the below log is printed:
(autoscaler +2m41s) Removing 1 nodes of type ray_worker_med (launch failed).

I can continue on the troubleshooting path with that information.

Strange thing is if the cluster uses rayproject/ray:latest-cpu the scaling works fine. But if I use the Dockerfile from the repo at ray/docker/base-deps/Dockerfile at releases/2.45.0 · ray-project/ray · GitHub the autoscaling fails. I want to deepdive into why this is happening.