it works locally on my mac, but once I try to run it inside a local docker container I get the following:
A warning: WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.39gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
after the warning it says: INFO worker.py:1528 -- Started a local Ray instance.
and a few seconds later I get this error: core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
I tried increasing the /dev/shm as explained and it didn’t help.
I also tried to limit the number of cpus in the init() command (as mentioned here)
Any ideas what can I do to solve it?
Thanks for you help
Could you share a Dockerfile so we can reproduce locally? It’s hard to tell what’s going on based on what you’ve provided here.
Generally, I recommend using the docker images we build for Ray instead of building your own, if possible. We run a lot of tests against them to make sure they’re ready for workloads.
ray stop; ray start --head --port=6379 --object-manager-port=8076 --port=9031 --no-monitor
I see that the raylet crashes, the logs (/tmp/ray/session_latest/logs/raylet.out) have this line:
[2022-12-27 04:21:01,838 E 97 168] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.
This indicates that the dashboard agent is the root cause of the failure. dashboard_agent.log doesn’t have anything interesting, but dashboard.log does:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/node/node_head.py”, line 317, in _update_node_stats
reply = await stub.GetNodeStats(
File “/usr/local/lib/python3.10/site-packages/grpc/aio/_call.py”, line 290, in await
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused”
debug_error_string = “UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.2:44953: Failed to connect to remote host: Connection refused {created_time:“2022-12-27T04:45:13.329170763+00:00”, grpc_status:14}”
I’m not sure if this is the root cause or a red herring…
Hi, are there any updates on this?
I’ve also tried to run the same code on two different machines with similar hardware.
On one machine, I don’t use docker, and it goes smoothly, on the other, I use docker, and it returns an OOM error after some times
I also meeted same problem, and it more too bad, it couldn’t install by pip command, throws some error message like below:
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray
and I retried used the Ray3.0 whl package under local install way. But still not succesful, it output this mesage :
ERROR: ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl is not a supported wheel on this platform.
I used the windows 10 system and use docker under WSL , the docker image is alpine3.19, the Linux kernel is : Linux py3-travel 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 Linux
So what I do next way? Hope you can give me some suggestion.