I am using ray-cluster on aws eks, and have my configuration in such a way that the /tmp/ray directory is mounted to the host rather than being a empty-dir. So was getting this in my node logs → Failed to connect to socket at address:/tmp/ray/session_2025-10-13_04-08-58_687729_1/sockets/raylet.3
As its an host mounted path so if there are two or more worker pods running on the same node they will share this folder, so can this cause this issue {can the worker use same temp_dir}.
Wanted to check if this issue can be a cause of above or some other underlying problem.
Note -: I am using this to transport my logs to datadog using the dd-agent mounted to the same path on the same node
1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
Yes, sharing the same /tmp/ray directory (hostPath) among multiple Ray pods on the same node can cause the “Failed to connect to socket” error. Ray expects each node (and thus each pod, if running multiple per node) to have an isolated temp directory, because it creates Unix domain sockets and other files that cannot be shared safely. When multiple pods share the same temp dir, they can overwrite each other’s socket files, leading to connection failures and instability. This is a known issue and has been reported by multiple users running Ray on shared filesystems or with hostPath mounts for /tmp/ray (see here, here).
The recommended solution is to ensure each Ray pod uses a unique, node-local temp directory (e.g., an emptyDir volume in Kubernetes) rather than a shared hostPath. If you need to collect logs, consider mounting a separate log directory for that purpose, rather than sharing the entire Ray temp directory. Would you like more detail on how to set this up in Kubernetes?
@RunLLM Is there a way to have the tmp dir for each pod at a custom, unique location, such as using the pod name for the folder? Like I have tried to pass the temp_dir env var on all pods appending the podname, but all the worker pods were also using the same name supplied to the head pod instead of their own.
Is that a mandate to use the same path for all?
Ray currently mandates that all nodes in a cluster (head and workers) use the same temp_dir path, which is set by the head node and propagated to workers. Even if you set the RAY_TMPDIR environment variable or pass a custom path on worker pods, Ray will override it with the head node’s temp_dir (see discussion, code reference). This is why your attempt to use pod-specific directories did not work.
There is no supported way to have each pod use a unique temp_dir path in a standard Ray cluster; all nodes must use the same path. This is a known limitation and has been discussed as a pain point for multi-pod-per-node or shared filesystem setups. Would you like more detail on possible workarounds or alternative log collection strategies?
I guess the user used the option --temp-dir in ray start to change the temp dir path of the worker node. If so, the reason why only the head node could change but the worker node could not is that, the option --temp-dir will be ignored when starting worker node. See the source code.
And in this case, the worker node get the ray_params with the temp_dir = None, then when initializing the worker node, it will call a function self._init_temp() to get the temp dir. For the worker node, as the ray_params.temp_dir is set to be None before, it will always get the temp dir from the head node gsc k-v store, see the source code. Finally, the temp dir path of worker nodes will be same as the head node.
To change the temp dir of the worker node directly, you can export an environment var RAY_TMPDIR. For example:
# start head node
ray start --head --temp-dir /tmp/ray-head ...
or
export RAY_TMPDIR=/tmp/ray-head && ray satrt --head ...
# start worker node
export RAY_TMPDIR=/tmp/ray-worker
ray start --address $head_ip
No, the above approach will not work as intended. Setting RAY_TMPDIR or using --temp-dir on worker nodes is ignored; worker nodes always use the temp_dir specified by the head node. This is confirmed in the source code and discussion: “--temp-dir is only configurable from a head node. All the worker nodes will use the same temp_dir as a head node.” Even if you export RAY_TMPDIR on the worker, Ray will override it with the head node’s temp_dir.
Would you like more detail on why this is enforced or on possible workarounds?