1. Severity of the issue: (select one)
[x ] High: Completely blocks me.
2. Environment:
- Ray version: 2.45.0
- Python version: 3.10
- OS: Ubuntu 22.04
- Cloud/Infrastructure: GCP
- Other libs/tools (if relevant): Check the requirements.txt and constraints.txt
3. What happened vs. what you expected:
- Expected: When running a job on Ray cluster starting with the ray docker image, the head node should be able to scale out and distribute the workload.
- Actual: When running a job on Ray cluster with the ray docker image, the tasks on the head node run without issues, while the scaling job fails. Logs below:
$ ray submit /opt/ray/config-docker-no-data.yaml scaling.py --verbose
2025-06-07 19:54:41,730 INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration from /tmp/ray-config-b8ca643054c58868c55bb0feb64a63327283e4a0
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2025-06-07 19:54:41,900 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-06-07 19:54:41,901 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
Fetched IP: 10.230.230.145
Running `mkdir -p /tmp/ray_tmp_mount/no-data-scaling/~ && chown -R ubuntu /tmp/ray_tmp_mount/no-data-scaling/~`
Warning: Permanently added '10.230.230.145' (ED25519) to the list of known hosts.
Shared connection to 10.230.230.145 closed.
Running `rsync --rsh ssh -i /home/biswalc/.ssh/ray-autoscaler_gcp_us-central1_rsc-general-computing_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4f0be890e6/e96c00e37e/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore scaling.py ubuntu@10.230.230.145:/tmp/ray_tmp_mount/no-data-scaling/~/scaling.py`
sending incremental file list
scaling.py
sent 938 bytes received 35 bytes 1,946.00 bytes/sec
total size is 1,687 speedup is 1.73
Running `docker inspect -f '{{.State.Running}}' ray_shuffling || true`
Shared connection to 10.230.230.145 closed.
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it ray_shuffling /bin/bash -c 'mkdir -p /root' && rsync -e 'docker exec -i' -avz /tmp/ray_tmp_mount/no-data-scaling/~/scaling.py ray_shuffling:/root/scaling.py`
sending incremental file list
scaling.py
sent 936 bytes received 35 bytes 1,942.00 bytes/sec
total size is 1,687 speedup is 1.74
Shared connection to 10.230.230.145 closed.
`rsync`ed scaling.py (local) to ~/scaling.py (remote)
2025-06-07 19:54:47,507 INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 10.230.230.145
Running `docker exec ray_shuffling printenv HOME`
Shared connection to 10.230.230.145 closed.
Running `docker exec -it ray_shuffling /bin/bash -c 'bash --login -c -i '"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /root/scaling.py)'"'"'' `
2025-06-07 12:54:50,408 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.230.230.145:6379...
2025-06-07 12:54:50,422 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.230.230.145:8265
Initial cluster resources: {'CPU': 4.0, 'object_store_memory': 4625979801.0, 'node:__internal_head__': 1.0, 'node:10.239.230.145': 1.0, 'memory': 9251959604.0}
Requesting 100 CPU-intensive remote tasks...
(intense_cpu_task pid=471) Starting CPU task on 10.230.230.145
(autoscaler +1m11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +1m11s) Adding 2 node(s) of type ray_worker_med.
(autoscaler +3m28s) Removing 1 nodes of type ray_worker_med (launch failed).
(autoscaler +4m49s) Adding 1 node(s) of type ray_worker_med.
(autoscaler +4m49s) Removing 1 nodes of type ray_worker_med (launch failed).
How can I find additional details on why the ray scaling is failing. If I use the default image, rayproject/ray:latest-cpu
the scaling works fine.
But when I use my own image built from scratch, it doesn’t scale. I want to find out what is missing in my Dockerfile.
Below is my Dockerfile:
FROM ubuntu:22.04
ARG PYTHON_VERSION="3.10"
ENV PYTHON_VERSION=${PYTHON_VERSION}
ENV AUTOSCALER="autoscaler"
ENV TZ="America/Los_Angeles"
ENV LC_ALL="C.UTF-8"
ENV LANG="C.UTF-8"
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y \
python3-distutils \
python3-testresources \
cmake \
curl \
g++ \
gcc \
git \
gnupg \
libffi-dev \
libjemalloc-dev \
netbase \
openssh-client \
parallel \
pkg-config \
rsync \
screen \
sudo \
tmux \
tzdata \
unzip \
wget \
zip \
zlib1g-dev
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 0 && \
update-alternatives --install /usr/bin/python python /usr/bin/python${PYTHON_VERSION} 0
RUN python3 --version && \
which python3 && \
python --version && \
which python
RUN curl -o get-pip.py https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py --user
RUN apt-get -y update && \
apt-get clean autoclean && \
apt-get autoremove -y --purge && \
rm -rf /var/lib/apt/lists/*
COPY *.txt .
RUN PATH="${HOME}/.local/bin:$PATH" \
python3 -m pip install \
-r requirements.txt
requirements.txt
absl-py==2.3.0
# via dm-tree
aiohappyeyeballs==2.6.1
# via aiohttp
aiohttp==3.12.11
# via
# -r requirements.in
# aiohttp-cors
aiohttp-cors==0.8.1
# via -r requirements.in
aiorwlock==1.5.0
# via -r requirements.in
aiosignal==1.3.2
# via aiohttp
annotated-types==0.7.0
# via pydantic
anyio==4.9.0
# via
# starlette
# watchfiles
async-timeout==5.0.1
# via aiohttp
attrs==25.3.0
# via
# aiohttp
# dm-tree
# jsonschema
# referencing
cachetools==5.5.2
# via google-auth
certifi==2025.4.26
# via requests
cffi==1.17.1
# via cryptography
charset-normalizer==3.4.2
# via requests
click==8.2.1
# via
# -r requirements.in
# ray
# typer
# uvicorn
cloudpickle==3.1.1
# via gymnasium
colorful==0.5.6
# via -r requirements.in
cryptography==45.0.3
# via pyopenssl
cupy-cuda12x==13.4.1
# via -r requirements.in
distlib==0.3.9
# via virtualenv
dm-tree==0.1.9
# via -r requirements.in
exceptiongroup==1.3.0
# via anyio
farama-notifications==0.0.4
# via gymnasium
fastapi==0.115.12
# via -r requirements.in
fastrlock==0.8.3
# via cupy-cuda12x
filelock==3.18.0
# via
# -r requirements.in
# ray
# virtualenv
frozenlist==1.6.2
# via
# aiohttp
# aiosignal
fsspec==2025.5.1
# via -r requirements.in
google-api-core==2.25.0
# via
# google-api-python-client
# opencensus
google-api-python-client==2.171.0
# via -r requirements.in
google-auth==2.40.3
# via
# google-api-core
# google-api-python-client
# google-auth-httplib2
google-auth-httplib2==0.2.0
# via google-api-python-client
googleapis-common-protos==1.70.0
# via
# google-api-core
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
grpcio==1.73.0
# via
# -r requirements.in
# opentelemetry-exporter-otlp-proto-grpc
gymnasium==1.0.0
# via -r requirements.in
h11==0.16.0
# via uvicorn
httplib2==0.22.0
# via
# google-api-python-client
# google-auth-httplib2
idna==3.10
# via
# anyio
# requests
# yarl
imageio==2.37.0
# via scikit-image
importlib-metadata==8.7.0
# via opentelemetry-api
jinja2==3.1.6
# via memray
jsonschema==4.24.0
# via
# -r requirements.in
# ray
jsonschema-specifications==2025.4.1
# via jsonschema
lazy-loader==0.4
# via scikit-image
linkify-it-py==2.0.3
# via markdown-it-py
lz4==4.4.4
# via -r requirements.in
markdown-it-py[linkify,plugins]==3.0.0
# via
# mdit-py-plugins
# rich
# textual
markupsafe==3.0.2
# via jinja2
mdit-py-plugins==0.4.2
# via markdown-it-py
mdurl==0.1.2
# via markdown-it-py
memray==1.17.2
# via -r requirements.in
msgpack==1.1.0
# via
# -r requirements.in
# ray
multidict==6.4.4
# via
# aiohttp
# yarl
networkx==3.4.2
# via scikit-image
numpy==2.2.6
# via
# -r requirements.in
# cupy-cuda12x
# dm-tree
# gymnasium
# imageio
# pandas
# scikit-image
# scipy
# tensorboardx
# tifffile
opencensus==0.11.4
# via -r requirements.in
opencensus-context==0.1.3
# via opencensus
opentelemetry-api==1.34.0
# via
# -r requirements.in
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
opentelemetry-exporter-otlp==1.34.0
# via -r requirements.in
opentelemetry-exporter-otlp-proto-common==1.34.0
# via
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-grpc==1.34.0
# via opentelemetry-exporter-otlp
opentelemetry-exporter-otlp-proto-http==1.34.0
# via opentelemetry-exporter-otlp
opentelemetry-proto==1.34.0
# via
# opentelemetry-exporter-otlp-proto-common
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
opentelemetry-sdk==1.34.0
# via
# -r requirements.in
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
opentelemetry-semantic-conventions==0.55b0
# via opentelemetry-sdk
packaging==25.0
# via
# -r requirements.in
# lazy-loader
# ray
# scikit-image
# tensorboardx
pandas==2.3.0
# via -r requirements.in
pillow==11.2.1
# via
# imageio
# scikit-image
platformdirs==4.3.8
# via
# textual
# virtualenv
prometheus-client==0.22.1
# via -r requirements.in
propcache==0.3.1
# via
# aiohttp
# yarl
proto-plus==1.26.1
# via google-api-core
protobuf==5.29.5
# via
# -r requirements.in
# google-api-core
# googleapis-common-protos
# opentelemetry-proto
# proto-plus
# ray
# tensorboardx
py-spy==0.4.0
# via -r requirements.in
pyarrow==20.0.0
# via -r requirements.in
pyasn1==0.6.1
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.4.2
# via google-auth
pycparser==2.22
# via cffi
pydantic==2.11.5
# via
# -r requirements.in
# fastapi
pydantic-core==2.33.2
# via pydantic
pygments==2.19.1
# via rich
pyopenssl==25.1.0
# via -r requirements.in
pyparsing==3.2.3
# via httplib2
python-dateutil==2.9.0.post0
# via pandas
pytz==2025.2
# via pandas
pyyaml==6.0.2
# via
# -r requirements.in
# ray
ray==2.45.0
# via -r requirements.in
referencing==0.36.2
# via
# jsonschema
# jsonschema-specifications
requests==2.32.4
# via
# -r requirements.in
# google-api-core
# opentelemetry-exporter-otlp-proto-http
# ray
rich==14.0.0
# via
# -r requirements.in
# memray
# textual
# typer
rpds-py==0.25.1
# via
# jsonschema
# referencing
rsa==4.9.1
# via google-auth
scikit-image==0.25.2
# via -r requirements.in
scipy==1.15.3
# via
# -r requirements.in
# scikit-image
shellingham==1.5.4
# via typer
six==1.17.0
# via
# opencensus
# python-dateutil
smart-open==7.1.0
# via -r requirements.in
sniffio==1.3.1
# via anyio
starlette==0.46.2
# via
# -r requirements.in
# fastapi
tensorboardx==2.6.2.2
# via -r requirements.in
textual==3.3.0
# via memray
tifffile==2025.5.10
# via scikit-image
typer==0.16.0
# via -r requirements.in
typing-extensions==4.14.0
# via
# anyio
# exceptiongroup
# fastapi
# gymnasium
# multidict
# opentelemetry-api
# opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
# pydantic
# pydantic-core
# pyopenssl
# referencing
# rich
# textual
# typer
# typing-inspection
# uvicorn
typing-inspection==0.4.1
# via pydantic
tzdata==2025.2
# via pandas
uc-micro-py==1.0.3
# via linkify-it-py
uritemplate==4.2.0
# via google-api-python-client
urllib3==2.4.0
# via requests
uvicorn==0.34.3
# via -r requirements.in
virtualenv==20.31.2
# via -r requirements.in
watchfiles==1.0.5
# via -r requirements.in
wrapt==1.17.2
# via
# dm-tree
# smart-open
yarl==1.20.0
# via aiohttp
zipp==3.23.0
# via importlib-metadata