Ray doesn't work in a docker container (linux)

hagai-arad · December 21, 2022, 8:43am

it works locally on my mac, but once I try to run it inside a local docker container I get the following:

A warning:
WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.39gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

after the warning it says: INFO worker.py:1528 -- Started a local Ray instance.

and a few seconds later I get this error:
core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I tried increasing the /dev/shm as explained and it didn’t help.
I also tried to limit the number of cpus in the init() command (as mentioned here)

Any ideas what can I do to solve it?
Thanks for you help

cade · December 22, 2022, 9:22pm

Hi @hagai-arad !

Could you share a Dockerfile so we can reproduce locally? It’s hard to tell what’s going on based on what you’ve provided here.

Generally, I recommend using the docker images we build for Ray instead of building your own, if possible. We run a lot of tests against them to make sure they’re ready for workloads.

hagai-arad · December 25, 2022, 8:11am

Sure @cade!

My Dockerfile:

FROM --platform=linux/amd64 python:3.10.9-slim-bullseye

ENV PYTHONPATH='/app-dir'

RUN pip install --no-cache-dir -r requirements.txt

requirements.txt:

ray==2.1.0
tqdm==4.64.1
boto3==1.26.22

I currently prefer to use my image. Thanks for the tip.

cade · December 27, 2022, 4:49am

I built the image and ran Ray using

ray stop; ray start --head --port=6379 --object-manager-port=8076 --port=9031 --no-monitor

I see that the raylet crashes, the logs (/tmp/ray/session_latest/logs/raylet.out) have this line:

[2022-12-27 04:21:01,838 E 97 168] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

This indicates that the dashboard agent is the root cause of the failure. dashboard_agent.log doesn’t have anything interesting, but dashboard.log does:

Traceback (most recent call last):
File “/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/node/node_head.py”, line 317, in _update_node_stats
reply = await stub.GetNodeStats(
File “/usr/local/lib/python3.10/site-packages/grpc/aio/_call.py”, line 290, in await
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused”
debug_error_string = “UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused {created_time:“2022-12-27T04:45:13.329170763+00:00”, grpc_status:14}”

I’m not sure if this is the root cause or a red herring…

cade · December 27, 2022, 4:51am

cc @sangcho any idea why this docker image isn’t working?

I also tried pip install 'ray[default]=2.1.0'.

cade · December 27, 2022, 5:15am

Looks like 44953 is the node manager port

NodeManager server started, listening on port 44953.

sangcho · January 8, 2023, 4:32pm

ah this is probably the grpc version issue. I’ve seen it doesn’t work well when the version is >= 1.5. Can you double check if the grpc version follows ray/setup.py at 0c8b59d2d90df0cfe0f17d9feb7ef9b3e5fe53f2 · ray-project/ray · GitHub?

zhz · January 18, 2023, 4:55pm

Thanks for reporting the issue @hagai-arad . Have you tried @sangcho 's suggestion?

hagai-arad · January 19, 2023, 10:10am

yes, it didn’t work unfortunately.
sorry for not answering. @sangcho also wrote me the same in stack overflow: python - Ray doesn't work in a docker container (linux) - Stack Overflow

sangcho · January 19, 2023, 11:28pm

We now have arm64 wheels available (from the master). Installing Ray — Ray 3.0.0.dev0

Maybe you can try that?

ColdFrenzy · May 16, 2024, 1:50pm

Hi, are there any updates on this?
I’ve also tried to run the same code on two different machines with similar hardware.
On one machine, I don’t use docker, and it goes smoothly, on the other, I use docker, and it returns an OOM error after some times

elkan1788 · May 17, 2024, 3:06am

I also meeted same problem, and it more too bad, it couldn’t install by pip command, throws some error message like below:

ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

and I retried used the Ray3.0 whl package under local install way. But still not succesful, it output this mesage :

ERROR: ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl is not a supported wheel on this platform.

I used the windows 10 system and use docker under WSL , the docker image is alpine3.19, the Linux kernel is : Linux py3-travel 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 Linux

So what I do next way? Hope you can give me some suggestion.

Thanks.

Topic		Replies	Views
Using /tmp instead of /dev/shm because of low memory Ray Core	1	1712	August 15, 2022
Docker. Using /tmp instead of /dev/shm because /dev/shm has only 31457280000 bytes available Ray Core	6	4674	September 6, 2022
Should I be concerned about this message "The object store is using /tmp instead of /dev/shm"? Ray Core	7	7594	May 2, 2021
Ray complains about not enough /dev/shm even when it sets the size itself	1	17	January 24, 2025
Problem specifying shm-size in cluster config Ray Clusters	0	532	June 1, 2021

Ray doesn't work in a docker container (linux)

Related topics