I was preferring to use tf2, because it is something I am more familiar with.
However, in my setup, ppo seems to run well, when I try running with tf2, every so often, things between the worker / environment stall.
I am guessing this is when the learner performs its learning.
Is there anyway to split the resource used for learning and worker (assuming that is the issue).
At the moment I have 16 CPU threads and 1 GPU core.
@NDR008 Thanks for posting. Pinging people from the RLlib team
@arturn @gjoliver @kourosh @avnishn @Rohan138
May 23, 2023, 10:03pm
Wait for me to give a more detailed update on this because I’ve observed:
PPO torch vs TF: torch is fine, tf does the behaviour I mentioned earlier.
A3C: torch is fine, tf crashes.
I have changed from a 1650 Super 4GB GPU to a 3090 24GB, but no improvement.
I’m starting to wonder if it is a tensorflow version issue. I’m on 2.6, or the way I’m configuring resources / learners / rollout workers.
@NDR008 I think you might be on to something regarding the tensforflow version issue. We always run release tests on both torch and tf for PPO and they run fine.
Here is our release requirement files:
# These requirements are used for the CI and CPU-only Docker images so we install CPU only versions of torch.
# For GPU Docker images, you should install requirements_ml_docker.txt afterwards.
tensorflow==2.11.0; sys_platform != 'darwin' or platform_machine != 'arm64'
tensorflow-macos==2.11.0; sys_platform == 'darwin' and platform_machine == 'arm64'
# If you make changes below this line, please also make the corresponding changes to `requirements_ml_docker.txt`
# and to `install-dependencies.sh`!
--extra-index-url https://download.pytorch.org/whl/cpu # for CPU versions of torch, torchvision
--find-links https://data.pyg.org/whl/torch-1.13.0+cpu.html # for CPU versions of torch-scatter, torch-sparse, torch-cluster, torch-spline-conv
This file has been truncated.
# Environment adapters.
# TODO(sven): Still needed for Atari (need to be wrapped by gymnasium as it does NOT support Atari yet)
# For testing MuJoCo envs with gymnasium.
# Kaggle envs.
# Unity3D testing
# TODO(sven): Add this back to requirements_rllib.txt once mlagents no longer pins torch<1.9.0 version.
# For tests on PettingZoo's multi-agent envs.
pettingzoo==1.22.1; python_version >= '3.7'
# When installing pettingzoo, chess is missing, even though its a dependancy
# TODO: remove if a future pettingzoo and/or ray version fixes this dependancy issue
Make sure you use these versions?