OwnerDiedError with Docker Swarm cluster

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.8.1 (I’m pretty sure this problem is version unrelated)
  • Python version: 3.8.13
  • OS: Ubuntu 22.04 (Host) / Docker Base Image “rayproject/ray:2.8.1”
  • Cloud/Infrastructure: Local Network, Docker Swarm network
  • Other libs/tools (if relevant): Docker Swarm

3. What happened vs. what you expected:

  • Expected: We are trying to setup a reusable RL training system, using Docker swarm to quickly transfer environment, then use Ray under the hood for scaling and parallelization. We expect when running a RLlib training with placement_strategy="SPREAD", different node should be utilized without an error.

More specifically, our settings are like this:

  1. Create a vitual local network overlay for the cluster.
docker network create --subnet 172.20.0.0/16 --gateway 172.20.0.1 net
docker network create --driver overlay my_overlay_network
  1. Create head node
docker service create \
   --name ray-head \
   --network my_overlay_network \
   --publish 6006:6006 \
   --publish 6379:6379 \
   --publish 8265:8265 \
   --publish 8000:8000 \
   --publish 9091:9091 \
   --publish 10001:10001 \
   --mount type=bind,source=/share/nfsdata/ray,target=/mnt/ray \
   --constraint 'node.labels.role == ray-head' \
   --generic-resource 'gpu=1' \
   --no-resolve-image \
   ray_cluster_281 \
   sh -c "ray start --head --port=6379 --dashboard-host=0.0.0.0 --include-dashboard=true && \
    /workspace/start.sh & \
    tail -f /dev/null"

Notice that we setup a NFS share storage for all node.

  1. Connect worker node through Docker Python API
ray_start_command = f"ray start --address='{ip}:6379'"

client.services.create(
	image="my-ray-cluster:latest",
	name=f'ray-worker-{device.id}',
	networks=["my_overlay_network"],
	command=["sh", "-c", ray_start_command],
	constraints=[f'node.id == {node_id}'],
	mounts=[{
		"Source": "/workspace/ray",
		"Target": "/mnt/ray",
		"Type": "bind"
	}]
)

  1. Submit RL training task through JobSubmissionClient Python API.
  • Actual: We can successfully create the Ray cluster, as it can be seen in head_node:8265 webpage. But when executing RLlib’s training task, there will be loads of OwnerDiedError flooding the logs, and the training is unstable, about to crash in 30min~1hour.

This can be replicated using even the simplest training script:

from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune, train


if __name__ == "__main__":
    config = (
        PPOConfig()
        .environment("CartPole-v1")
        .framework("torch")
        .training(_enable_learner_api=False)
        .rollouts(num_rollout_workers=8)
        .resources(num_gpus=1, placement_strategy="SPREAD")
        .rl_module(_enable_rl_module_api=False)
    )

    tuner = tune.Tuner(
        "PPO",
        param_space=config.to_dict(),
        run_config=train.RunConfig(
            storage_path="/mnt/ray",  # NFS directory
            checkpoint_config=train.CheckpointConfig(checkpoint_at_end=True),
            stop={"training_iteration": 1000},
        ),
    )

    results = tuner.fit()

4. What we have tried:

  1. We first suspect that this is something to do with networking issue, as previous post like [Serve] The ray start --head --node-ip-address ip is not working correctly in Docker. And it’s not clear which ports to open - Ray Serve - Ray suggested. However, publishing all ports between --min/max-worker-port does not help, the problem still exists.
  2. We tried to manually starting these containers (with docker run), and set the --network=host. By making all machines nearly physical machine, the problem disappeared. This further making me feel like this is something to do with network, but I don’t know what exactly.
  3. Another thing is that, even with --network=host, the task balancing doesn’t seems to be even. The head not would use 8 CPU, and the worker node only 1 CPU. But perhaps this is because our head node’s spec is significantly better than the worker node.

Your issue is likely caused by networking and object ownership problems in Ray when running in Docker Swarm with shared NFS storage. OwnerDiedError typically means the Python worker that created an object (or actor) has died or become unreachable, often due to network misconfiguration, container isolation, or improper volume sharing. Ray clusters in Docker require careful network and volume setup; sharing /tmp/ray or object store directories between containers can cause instability, and overlay networks may not expose all required ports or allow direct container-to-container communication as Ray expects. Using --network=host resolves many of these issues because it removes network isolation, but this is not available in Swarm mode and is not ideal for production. Also, Ray’s placement_strategy=“SPREAD” will only distribute tasks if resources and connectivity are correctly configured across nodes.

According to the Ray documentation, you must ensure that all required Ray ports are open and mapped, and that each node advertises its correct, externally reachable IP address using --node-ip-address. Avoid sharing /tmp/ray or object store directories between containers, as this can cause reference and ownership conflicts leading to OwnerDiedError. For Docker Swarm, you may need to use host networking or advanced network configuration to ensure full connectivity, but this is not always possible. If you see uneven task distribution, it may be due to resource reporting or node specs, but the primary instability is almost certainly due to networking and/or shared storage issues. See the FAQ for more on networking and node IP configuration: Ray Cluster FAQ.

Would you like a step-by-step breakdown of how to configure Ray clusters in Docker Swarm to avoid these issues?

Sources:

Hint: Mention @RunLLM in the post for followups.

TLDR Workaround:Use host network for the docker service group.

I’ve been looking into KubeRay for a day, and find no luck. It seems to me that the KubeRay is only exposing the least amount of necessary ports (8265, 10001, etc.) and it would work fine. I’m not very familiar with k8s, but since both of it and Docker is container based distributing, I don’t think there will be too much difference.