Making the selection of action itself "stochastic"

hridayns · September 29, 2022, 2:39pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello, I am not entirely sure about how this would work so any clarifications will help really. In my multi-agent environment where all the agents share a certain policy. I want some of the agents to stochastically choose to do “nothing” (or a specific manual action) while the other agents select the best action / exploratory action?

Or maybe can I change the exploratory action performed to randomly choose between a default action of my choosing and truly random action?

OR

Is it possible to code such a thing where compute_action will only stochastically be called, and if it not called a “default” action is performed by the agent?

Thank you again.

mannyv · September 30, 2022, 1:29pm

Hi @hridayns

I think the most straightforward way to attempt this is to create your own custom exploration class.

If you look at the snippet below you will see an example of how the EpsilonGreedy exploration class does it. It selects a random action with probability Epsilon and the greedy action with probability 1-epsilon. I think you could adapt that to fit your needs.

github.com

ray-project/ray/blob/master/rllib/utils/exploration/epsilon_greedy.py#L181-L197


      
          exploit_action = action_distribution.deterministic_sample()
          batch_size = q_values.size()[0]
          action_logp = torch.zeros(batch_size, dtype=torch.float)
          
          
# Explore.
          if explore:
              # Get the current epsilon.
              epsilon = self.epsilon_schedule(self.last_timestep)
              if isinstance(action_distribution, TorchMultiActionDistribution):
                  exploit_action = tree.flatten(exploit_action)
                  for i in range(batch_size):
                      if random.random() < epsilon:
                          # TODO: (bcahlit) Mask out actions
                          random_action = tree.flatten(self.action_space.sample())
                          for j in range(len(exploit_action)):
                              exploit_action[j][i] = torch.tensor(random_action[j])
                  exploit_action = tree.unflatten_as(

hridayns · September 30, 2022, 2:25pm

Hello @mannyv, thank you for your response. I am trying to override the StochasticSampling class since that seems to be closer to what I am describing. However, when I try to print TF values while running an evaluation with configuration:

.exploration(
  explore=True,
  exploration_config={'type': CustomExploration}
)

but then in the CustomExploration class:

.
.
.
def _get_tf_exploration_action_op(self, action_dist, timestep, explore):
        ts = self.last_timestep + 1

        stochastic_actions = tf.cond(
            pred=tf.convert_to_tensor(ts < self.random_timesteps),
            true_fn=lambda: (
                self.random_exploration.get_tf_exploration_action_op(
                    action_dist, explore=True
                )[0]
            ),
            false_fn=lambda: action_dist.sample(),
        )
        deterministic_actions = action_dist.deterministic_sample()
        
        print(deterministic_actions) # Does not print during each step in the evaluation run? Only executes once or twice?

        action = tf.cond(
            tf.constant(explore) if isinstance(explore, bool) else explore,
            true_fn=lambda: stochastic_actions,
            false_fn=lambda: deterministic_actions,
        )

        logp = tf.cond(
            tf.math.logical_and(
                explore, tf.convert_to_tensor(ts >= self.random_timesteps)
            ),
            true_fn=lambda: action_dist.sampled_action_logp(),
            false_fn=functools.partial(zero_logps_from_actions, deterministic_actions),
        )

        # Increment `last_timestep` by 1 (or set to `timestep`).
        if self.framework in ["tf2", "tfe"]:
            self.last_timestep.assign_add(1)
            return action, logp
        else:
            assign_op = (
                tf1.assign_add(self.last_timestep, 1)
                if timestep is None
                else tf1.assign(self.last_timestep, timestep)
            )
            with tf1.control_dependencies([assign_op]):
                return action, logp
.
.
.

The print(deterministic_actions) line does not execute every step of the evaluation run which is baffling? From my understanding, _get_tf_exploration_action_op should be called every step, but the behaviour I see is confusing - only prints 2 times before the start of the run? Any clue?

mannyv · September 30, 2022, 6:36pm

Hi @hridayns

That is probably because you are running with a tf static graph. You need to enable eager mode if you want to include print statements. Either that or use tf.print along with tf.control_dependencies.

hridayns · September 30, 2022, 6:53pm

I did enable eager tracing However, the issue is more the frequency of output rather than the contents of the output ! Shouldn’t it print something at every step? Not just at the beginning…two times?

hridayns · October 2, 2022, 11:18pm

Sorry, I am not able to make it output something at every step and it doesn’t make sense. Am I supposed to put it in some kind of callback function? Isn’t _get_tf_exploration_action_op the function that gets called at every single step for a tensorflow framework user?

mannyv · October 3, 2022, 11:43am

Hi @hridayns,

Have you tried running with the config option, framework="tf2"?

hridayns · October 3, 2022, 11:59am

Oh I thought I already had eager tracing on with eager_tracing=True. Turns out you also have to set the framework=“tf2” which produces the following stack trace:

2022-10-03 11:46:09,971	INFO worker.py:1518 -- Started a local Ray instance.
(PPO pid=4701) 2022-10-03 11:46:14,072	INFO algorithm.py:1861 -- Executing eagerly (framework='tf2'), with eager_tracing=False. For production workloads, make sure to set eager_tracing=True  in order to match the speed of tf-static-graph (framework='tf'). For debugging purposes, `eager_tracing=False` is the best choice.
(PPO pid=4701) 2022-10-03 11:46:14,074	INFO algorithm.py:354 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=4735) 2022-10-03 11:46:16,610	ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=4735, ip=192.168.0.42, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f5d183239b0>)
(RolloutWorker pid=4735)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 617, in __init__
(RolloutWorker pid=4735)     seed=seed,
(RolloutWorker pid=4735)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1790, in _build_policy_map
(RolloutWorker pid=4735)     merged_conf,
(RolloutWorker pid=4735)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/policy_map.py", line 121, in create_policy
(RolloutWorker pid=4735)     _class = get_tf_eager_cls_if_necessary(policy_cls, merged_config)
(RolloutWorker pid=4735)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/tf_utils.py", line 262, in get_tf_eager_cls_if_necessary
(RolloutWorker pid=4735)     "This policy does not support eager " "execution: {}".format(orig_cls)
(RolloutWorker pid=4735) ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>
2022-10-03 11:46:16,620	ERROR trial_runner.py:980 -- Trial PPO(nv-100_nl-3_nz-4_ls-1000.0_nb-1_bls-250.0_blp-750.0_bll-0_bdur-END_puf-1_ec-0.1_df-0.1): Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 127, in __init__
    validate=trainer_config.get("validate_workers_after_construction"),
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 269, in add_workers
    self.foreach_worker(lambda w: w.assert_healthy())
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 391, in foreach_worker
    remote_results = ray.get([w.apply.remote(func) for w in self.remote_workers()])
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=4735, ip=192.168.0.42, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f5d183239b0>)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 617, in __init__
    seed=seed,
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1790, in _build_policy_map
    merged_conf,
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/policy_map.py", line 121, in create_policy
    _class = get_tf_eager_cls_if_necessary(policy_cls, merged_config)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/tf_utils.py", line 262, in get_tf_eager_cls_if_necessary
    "This policy does not support eager " "execution: {}".format(orig_cls)
ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>

During handling of the above exception, another exception occurred:

ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
    super().__init__(config=config, logger_creator=logger_creator, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable/trainable.py", line 157, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 443, in setup
    raise e.args[0].args[2]
ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>

2022-10-03 11:46:16,624	ERROR ray_trial_executor.py:104 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
    ray.get(future, timeout=0)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 127, in __init__
    validate=trainer_config.get("validate_workers_after_construction"),
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 269, in add_workers
    self.foreach_worker(lambda w: w.assert_healthy())
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 391, in foreach_worker
    remote_results = ray.get([w.apply.remote(func) for w in self.remote_workers()])
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=4735, ip=192.168.0.42, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f5d183239b0>)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 617, in __init__
    seed=seed,
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1790, in _build_policy_map
    merged_conf,
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/policy_map.py", line 121, in create_policy
    _class = get_tf_eager_cls_if_necessary(policy_cls, merged_config)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/tf_utils.py", line 262, in get_tf_eager_cls_if_necessary
    "This policy does not support eager " "execution: {}".format(orig_cls)
ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>

During handling of the above exception, another exception occurred:

ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
    super().__init__(config=config, logger_creator=logger_creator, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable/trainable.py", line 157, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 443, in setup
    raise e.args[0].args[2]
ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>

(PPO pid=4701) 2022-10-03 11:46:16,617	ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 127, in __init__
(PPO pid=4701)     validate=trainer_config.get("validate_workers_after_construction"),
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 269, in add_workers
(PPO pid=4701)     self.foreach_worker(lambda w: w.assert_healthy())
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/worker_set.py", line 391, in foreach_worker
(PPO pid=4701)     remote_results = ray.get([w.apply.remote(func) for w in self.remote_workers()])
(PPO pid=4701) ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=4735, ip=192.168.0.42, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f5d183239b0>)
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 617, in __init__
(PPO pid=4701)     seed=seed,
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1790, in _build_policy_map
(PPO pid=4701)     merged_conf,
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/policy_map.py", line 121, in create_policy
(PPO pid=4701)     _class = get_tf_eager_cls_if_necessary(policy_cls, merged_config)
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/tf_utils.py", line 262, in get_tf_eager_cls_if_necessary
(PPO pid=4701)     "This policy does not support eager " "execution: {}".format(orig_cls)
(PPO pid=4701) ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>
(PPO pid=4701) 
(PPO pid=4701) During handling of the above exception, another exception occurred:
(PPO pid=4701) 
(PPO pid=4701) ray::PPO.__init__() (pid=4701, ip=192.168.0.42, repr=PPO)
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
(PPO pid=4701)     super().__init__(config=config, logger_creator=logger_creator, **kwargs)
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable/trainable.py", line 157, in __init__
(PPO pid=4701)     self.setup(copy.deepcopy(self.config))
(PPO pid=4701)   File "/usr/local/lib/python3.6/dist-packages/ray/rllib/algorithms/algorithm.py", line 443, in setup
(PPO pid=4701)     raise e.args[0].args[2]
(PPO pid=4701) ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>
Traceback (most recent call last):
  File "train.py", line 327, in <module>
    _main()
  File "train.py", line 319, in _main
    local_dir='~/devel/rllibsumoutils/pheromone-RL/pheromone-PPO/tune-results',
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-100_nl-3_nz-4_ls-1000.0_nb-1_bls-250.0_blp-750.0_bll-0_bdur-END_puf-1_ec-0.1_df-0.1)])

TLDR; ValueError: This policy does not support eager execution: <class 'ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF1Policy'>

hridayns · October 3, 2022, 12:34pm

So I am now importing ray.rllib.algorithms.ppo.ppo_tf_policy.PPOTF2Policy and the issue still seems to be that the frequency of output is inconsistent - only prints the first two times and then never prints per step… Also I do not understand the actual difference between the two… PPOTF2Policy vs PPOTF1Policy - is it only for eager execution?

mannyv · October 3, 2022, 12:37pm

@hridayns,

For debugging you want eager_tracing=False

hridayns · October 3, 2022, 12:39pm

Okay, thank you. that is now working

mannyv · October 3, 2022, 12:43pm

@hridayns,
Great! "Keep in mind that I have found tf to be much faster than tf2 in my own work so when you are done debugging you may want to set framework back to tf.

hridayns · October 3, 2022, 1:25pm

Ah, I see. Thank you!

Topic		Replies	Views
How to define a randomly action taking multi-agent team RLlib	0	260	June 8, 2023
Action Masking Model: Deterministic selection of the best action RLlib	0	27	August 11, 2024
All or nothing (Explore or sample) actions - correct for each step? RLlib	2	288	October 6, 2022
[rllib] Customized action distribution of probability matrices RLlib	1	320	November 9, 2022
Inconsistent actions from Algorithm.compute_single_action RLlib	3	414	June 14, 2023

Making the selection of action itself "stochastic"

Related topics