High Memory Usage

Collin_Phillips · September 7, 2022, 3:56pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I upgraded my development system while Ray 2.0 came out, but now that I’m getting back into my workflow, and using my new resources, I’m noticing that my runs are using more memory than before. Just following the fractional_gpu example on github (mostly, code pasted below), my system is completely maxing out on RAM and VRAM usage. It will still run in this case, but for my application it quickly hits OOM issues. It just seems like the training process shouldn’t require this many resources.

from ray import air, tune
from ray.tune.registry import register_env
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from pettingzoo.mpe import simple_spread_v2

# Based on code from github.com/parametersharingmadrl/parametersharingmadrl

if __name__ == "__main__":

    register_env("simple_spread", lambda _: PettingZooEnv(simple_spread_v2.env()))

    tune.Tuner(
        "PPO",
        run_config=air.RunConfig(
            stop={"episodes_total": 60000},
            checkpoint_config=air.CheckpointConfig(
                checkpoint_frequency=10,
            ),
        ),
        param_space={
            # Enviroment specific.
            "env": "simple_spread",
            
            # General
            "framework":"torch",
            "num_gpus": 0.001,
            "num_workers": 20,
            "num_gpus_per_worker": (1-0.001)/21,
            "num_envs_per_worker": 1,
            "compress_observations": True,

            # Algorithm Specific
            "lambda": 0.99,
            "train_batch_size": 512,
            "sgd_minibatch_size": 32,
            "num_sgd_iter": 5,
            "batch_mode": "truncate_episodes",
            "entropy_coeff": 0.01,
            "lr": 2e-5,

            #Multiagent
            "multiagent": {
                "policies": {"shared_policy"},
                "policy_mapping_fn": (
                    lambda agent_id, episode, **kwargs: "shared_policy"
                ),
            },
        },
    ).fit()

arturn · September 8, 2022, 2:50pm

Hi @Collin_Phillips ,

Thanks for providing a concise problem description.
I have reproduces this on an AWS g3 instance.
the num_gpus_per worker ends up being 0,047571428571429 though.
This might just be a rounding issue. Have you tried with a little more conservative share? Like (1-0.001)/22?

I have opened an issue to track this.

Cheers

Collin_Phillips · September 8, 2022, 4:01pm

Thanks for the response. I just tried it again with (1-0.001)/22 and saw the same resource usage behavior.

For comparison, the same config without fractional gpus (i.e. num_gpus=1) uses a fraction of the resources.

Would it be best to move the discussion to the github issue from here?

arturn · September 8, 2022, 5:00pm

Thanks for trying this out. And yes, let’s move this to GH.

Topic		Replies	Views
Large (5x) difference in Ray AIR memory usage on different machines	4	449	January 12, 2023
Expected RAM usage for PPOTrainer (debugging memory leaks) RLlib	10	949	September 15, 2022
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1211	May 29, 2023
Specifying memory requirement for RLlib algorithms in Ray Tune etc RLlib	3	397	January 7, 2023
Memory Leak when training PPO on a single agent environment RLlib	15	1636	December 24, 2022

High Memory Usage

Related topics