Ray doesn't work in a docker container (linux)

it works locally on my mac, but once I try to run it inside a local docker container I get the following:

A warning:
WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.39gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

after the warning it says: INFO worker.py:1528 -- Started a local Ray instance.

and a few seconds later I get this error:
core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I tried increasing the /dev/shm as explained and it didn’t help.
I also tried to limit the number of cpus in the init() command (as mentioned here)

Any ideas what can I do to solve it?
Thanks for you help

Hi @hagai-arad !

Could you share a Dockerfile so we can reproduce locally? It’s hard to tell what’s going on based on what you’ve provided here.

Generally, I recommend using the docker images we build for Ray instead of building your own, if possible. We run a lot of tests against them to make sure they’re ready for workloads.

Sure @cade!

My Dockerfile:

FROM --platform=linux/amd64 python:3.10.9-slim-bullseye

ENV PYTHONPATH='/app-dir'

RUN pip install --no-cache-dir -r requirements.txt

requirements.txt:

ray==2.1.0
tqdm==4.64.1
boto3==1.26.22

I currently prefer to use my image. Thanks for the tip.

I built the image and ran Ray using

ray stop; ray start --head --port=6379 --object-manager-port=8076 --port=9031 --no-monitor

I see that the raylet crashes, the logs (/tmp/ray/session_latest/logs/raylet.out) have this line:

[2022-12-27 04:21:01,838 E 97 168] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

This indicates that the dashboard agent is the root cause of the failure. dashboard_agent.log doesn’t have anything interesting, but dashboard.log does:

Traceback (most recent call last):
File “/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/node/node_head.py”, line 317, in _update_node_stats
reply = await stub.GetNodeStats(
File “/usr/local/lib/python3.10/site-packages/grpc/aio/_call.py”, line 290, in await
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused”
debug_error_string = “UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused {created_time:“2022-12-27T04:45:13.329170763+00:00”, grpc_status:14}”

I’m not sure if this is the root cause or a red herring…

cc @sangcho any idea why this docker image isn’t working?

I also tried pip install 'ray[default]=2.1.0'.

Looks like 44953 is the node manager port

NodeManager server started, listening on port 44953.

ah this is probably the grpc version issue. I’ve seen it doesn’t work well when the version is >= 1.5. Can you double check if the grpc version follows ray/setup.py at 0c8b59d2d90df0cfe0f17d9feb7ef9b3e5fe53f2 · ray-project/ray · GitHub?

1 Like

Thanks for reporting the issue @hagai-arad . Have you tried @sangcho 's suggestion?

yes, it didn’t work unfortunately.
sorry for not answering. @sangcho also wrote me the same in stack overflow: python - Ray doesn't work in a docker container (linux) - Stack Overflow

We now have arm64 wheels available (from the master). Installing Ray — Ray 3.0.0.dev0

Maybe you can try that?

Hi, are there any updates on this?
I’ve also tried to run the same code on two different machines with similar hardware.
On one machine, I don’t use docker, and it goes smoothly, on the other, I use docker, and it returns an OOM error after some times

I also meeted same problem, and it more too bad, it couldn’t install by pip command, throws some error message like below:

ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

and I retried used the Ray3.0 whl package under local install way. But still not succesful, it output this mesage :

ERROR: ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl is not a supported wheel on this platform.

I used the windows 10 system and use docker under WSL , the docker image is alpine3.19, the Linux kernel is : Linux py3-travel 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 Linux

So what I do next way? Hope you can give me some suggestion.

Thanks.