Actions created by Policy being modified before input to environment

henry_lei · March 6, 2023, 11:16pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,

I’m trying to set up a multiagent training scenario where there are some learned policies training alongside with predefined heuristic policies, akin to the examples in “rock_paper_scissors_multiagent” and “multiagent_custom_policy” and “multiagent_different_spaces…”. I’ve created a hand-made policy that looks like it’s working independently, but when I try to incorporate it into training, the actions that are being passed into my environment don’t match up. For example, if the action the policy spits out is (5, -10, 5) the action being passed into my environment would be (10000, -10000, 10000), which corresponds to the box action space bounds that I’ve defined. Any one know what might be the issue?

Using: Ray 2.0.0, Windows, R2D2 as training algorithm

Thanks!

Henry

henry_lei · March 8, 2023, 11:39pm

Upon some further testing, it seems this only occurs when I define continuous action spaces. I tested my training set up with the RandomPolicy example, but I’m getting the same issue. I’ve tried specifying the action distribution as “deterministic” from the TF_action_distribtions, but it doesn’t work. Here’s how I set up my Policies in the policy map:

PolicySpec(policy_class=RandomPolicy, observation_space=obs_space_low,
action_space=act_space_low, config={“model”: {“custom_action_dist”: Deterministic}})

mannyv · March 9, 2023, 2:46am

Hi @henry_lei,

I think this is the setting you are looking for.

github.com

ray-project/ray/blob/16b2963f1177b207a64ae38716c694e1f524ea4f/rllib/algorithms/algorithm.py#L1440


      
              policy_id: Policy to query (only applies to multi-agent).
                  Default: "default_policy".
              full_fetch: Whether to return extra action fetch results.
                  This is always set to True if `state` is specified.
              explore: Whether to apply exploration to the action.
                  Default: None -> use self.config.explore.
              timestep: The current (sampling) time step.
              episode: This provides access to all of the internal episodes'
                  state, which may be useful for model-based or multi-agent
                  algorithms.
              unsquash_action: Should actions be unsquashed according to the
                  env's/Policy's action space? If None, use the value of
                  self.config.normalize_actions.
              clip_action: Should actions be clipped according to the
                  env's/Policy's action space? If None, use the value of
                  self.config.clip_actions.
          
          
Keyword Args:
              kwargs: forward compatibility placeholder
          
          
Returns:

henry_lei · March 15, 2023, 8:51pm

That was it, thanks!

Jules_Damji · March 15, 2023, 9:05pm

looks like this worked.

Topic		Replies	Views
Training with a random policy Configure Algorithm, Training, Evaluation, Scaling	11	955	November 11, 2022
Where get_action_dist() is getting called? RLlib	4	36	October 30, 2024
Multiagent PPO with custom model gives actions that are outside of the action space RLlib	2	353	October 5, 2021
Training a custom agent against a random policy RLlib	0	215	November 14, 2023
Is any multi discrete action example for PPO or other algorithms? RLlib	9	4356	January 29, 2023

Actions created by Policy being modified before input to environment

Related topics