1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.8.1 (I’m pretty sure this problem is version unrelated)
- Python version: 3.8.13
- OS: Ubuntu 22.04 (Host) / Docker Base Image “rayproject/ray:2.8.1”
- Cloud/Infrastructure: Local Network, Docker Swarm network
- Other libs/tools (if relevant): Docker Swarm
3. What happened vs. what you expected:
- Expected: We are trying to setup a reusable RL training system, using Docker swarm to quickly transfer environment, then use Ray under the hood for scaling and parallelization. We expect when running a RLlib training with
placement_strategy="SPREAD", different node should be utilized without an error.
More specifically, our settings are like this:
- Create a vitual local network overlay for the cluster.
docker network create --subnet 172.20.0.0/16 --gateway 172.20.0.1 net
docker network create --driver overlay my_overlay_network
- Create head node
docker service create \
--name ray-head \
--network my_overlay_network \
--publish 6006:6006 \
--publish 6379:6379 \
--publish 8265:8265 \
--publish 8000:8000 \
--publish 9091:9091 \
--publish 10001:10001 \
--mount type=bind,source=/share/nfsdata/ray,target=/mnt/ray \
--constraint 'node.labels.role == ray-head' \
--generic-resource 'gpu=1' \
--no-resolve-image \
ray_cluster_281 \
sh -c "ray start --head --port=6379 --dashboard-host=0.0.0.0 --include-dashboard=true && \
/workspace/start.sh & \
tail -f /dev/null"
Notice that we setup a NFS share storage for all node.
- Connect worker node through Docker Python API
ray_start_command = f"ray start --address='{ip}:6379'"
client.services.create(
image="my-ray-cluster:latest",
name=f'ray-worker-{device.id}',
networks=["my_overlay_network"],
command=["sh", "-c", ray_start_command],
constraints=[f'node.id == {node_id}'],
mounts=[{
"Source": "/workspace/ray",
"Target": "/mnt/ray",
"Type": "bind"
}]
)
- Submit RL training task through
JobSubmissionClientPython API.
- Actual: We can successfully create the Ray cluster, as it can be seen in
head_node:8265webpage. But when executing RLlib’s training task, there will be loads of OwnerDiedError flooding the logs, and the training is unstable, about to crash in 30min~1hour.
This can be replicated using even the simplest training script:
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune, train
if __name__ == "__main__":
config = (
PPOConfig()
.environment("CartPole-v1")
.framework("torch")
.training(_enable_learner_api=False)
.rollouts(num_rollout_workers=8)
.resources(num_gpus=1, placement_strategy="SPREAD")
.rl_module(_enable_rl_module_api=False)
)
tuner = tune.Tuner(
"PPO",
param_space=config.to_dict(),
run_config=train.RunConfig(
storage_path="/mnt/ray", # NFS directory
checkpoint_config=train.CheckpointConfig(checkpoint_at_end=True),
stop={"training_iteration": 1000},
),
)
results = tuner.fit()
4. What we have tried:
- We first suspect that this is something to do with networking issue, as previous post like [Serve] The
ray start --head --node-ip-address ipis not working correctly in Docker. And it’s not clear which ports to open - Ray Serve - Ray suggested. However, publishing all ports between--min/max-worker-portdoes not help, the problem still exists. - We tried to manually starting these containers (with
docker run), and set the--network=host. By making all machines nearly physical machine, the problem disappeared. This further making me feel like this is something to do with network, but I don’t know what exactly. - Another thing is that, even with
--network=host, the task balancing doesn’t seems to be even. The head not would use 8 CPU, and the worker node only 1 CPU. But perhaps this is because our head node’s spec is significantly better than the worker node.
