PPO torch vs tf2

NDR008 · May 21, 2023, 7:21pm

Hi,

I was preferring to use tf2, because it is something I am more familiar with.
However, in my setup, ppo seems to run well, when I try running with tf2, every so often, things between the worker / environment stall.

I am guessing this is when the learner performs its learning.
Is there anyway to split the resource used for learning and worker (assuming that is the issue).

At the moment I have 16 CPU threads and 1 GPU core.

Jules_Damji · May 23, 2023, 8:13pm

@NDR008 Thanks for posting. Pinging people from the RLlib team

cc: @arturn @gjoliver @kourosh @avnishn @Rohan138

NDR008 · May 23, 2023, 10:03pm

Wait for me to give a more detailed update on this because I’ve observed:
PPO torch vs TF: torch is fine, tf does the behaviour I mentioned earlier.
A3C: torch is fine, tf crashes.

I have changed from a 1650 Super 4GB GPU to a 3090 24GB, but no improvement.

I’m starting to wonder if it is a tensorflow version issue. I’m on 2.6, or the way I’m configuring resources / learners / rollout workers.

kourosh · May 24, 2023, 3:17pm

@NDR008 I think you might be on to something regarding the tensforflow version issue. We always run release tests on both torch and tf for PPO and they run fine.
Here is our release requirement files:

github.com

ray-project/ray/blob/master/python/requirements/ml/requirements_dl.txt

# These requirements are used for the CI and CPU-only Docker images so we install CPU only versions of torch.
# For GPU Docker images, you should install requirements_ml_docker.txt afterwards.

tensorflow==2.11.0; sys_platform != 'darwin' or platform_machine != 'arm64'
tensorflow-macos==2.11.0; sys_platform == 'darwin' and platform_machine == 'arm64'
tensorflow-probability==0.19.0

# If you make changes below this line, please also make the corresponding changes to `requirements_ml_docker.txt`
# and to `install-dependencies.sh`!

--extra-index-url https://download.pytorch.org/whl/cpu  # for CPU versions of torch, torchvision
--find-links https://data.pyg.org/whl/torch-1.13.0+cpu.html  # for CPU versions of torch-scatter, torch-sparse, torch-cluster, torch-spline-conv
torch==1.13.0
torchvision==0.14.0
torch-scatter==2.1.0
torch-sparse==0.6.16
torch-cluster==1.6.0
torch-spline-conv==1.2.1
torch-geometric==2.1.0

github.com

ray-project/ray/blob/master/python/requirements/ml/requirements_rllib.txt

-r requirements_dl.txt

# Environment adapters.
# ---------------------
# Atari
# TODO(sven): Still needed for Atari (need to be wrapped by gymnasium as it does NOT support Atari yet)
gym==0.26.2
gymnasium[atari,mujoco]==0.26.3
# For testing MuJoCo envs with gymnasium.
mujoco-py<2.2,>=2.1
# Kaggle envs.
kaggle_environments==1.7.11
# Unity3D testing
# TODO(sven): Add this back to requirements_rllib.txt once mlagents no longer pins torch<1.9.0 version.
#mlagents==0.28.0
mlagents_envs==0.28.0
# For tests on PettingZoo's multi-agent envs.
pettingzoo==1.22.1; python_version >= '3.7'
# When installing pettingzoo, chess is missing, even though its a dependancy
# TODO: remove if a future pettingzoo and/or ray version fixes this dependancy issue

This file has been truncated. show original

Make sure you use these versions?

Topic		Replies	Views
Memory leak CPU RAM with Tf2 eager execution RLlib	2	442	August 3, 2021
GPUs not detected RLlib	7	4083	February 21, 2023
Reproducibility of training Results on PPO algorithm RLlib	4	444	September 24, 2021
Tf2 slower 6-8 times than pytorch RLlib	5	500	June 28, 2022
RLib on multiple GPUs with framework tf2 RLlib	3	557	April 20, 2023

PPO torch vs tf2

Related topics