Tensor values in static graph mode

I am trying to build a complex exploration algorithm to be used in RLlib. As I add an exploration loss to the policy loss I have a code section where the sample batch contains Tensors and not arrays. In this section the Tensors cannot be evaluated (which is needed for metrics) when in static graph mode.

I already tried to wrap my function into @make_tf_callable but that does not help as I need a feed_dict for this, too and this is not available. Furthermore, in eager_tracing mode there is no graph available in the policies. I guess in this case functions have to be wrapped into tf.function().

See for an example my PR.

Is there any way how these Tensors can get evaluated?

Feel free to use the following script to execute an example of my branch:

import os
import ray
from ray import tune
from ray.rllib.algorithms.ppo import ppo
from ray.rllib.utils.exploration.callbacks import RNDMetricsCallbacks

config = (
    ppo.PPOConfig()
    .environment(
        env="FrozenLake-v1",
    ).
    framework(
        framework="tf",
        # switch eager tracing on to see that no session is available 
        # in this mode.
        #eager_tracing=True,
    )
    .training(
        num_sgd_iter=8,
    )
    .rollouts(
        num_envs_per_worker=4,
        num_rollout_workers=0,
    )
    .debugging(
        log_level="DEBUG",
        seed=2,
    )
    .exploration(
        exploration_config={
            "type": "RND",
            "embed_dim": 64,
            "lr": 0.0001,
            "intrinsic_reward_coeff": 0.005,
            "nonepisodic_returns": True,
            "sub_exploration": {
                "type": "StochasticSampling",
            },
        },
    )
    #.callbacks(RNDMetricsCallbacks)
) 

ray.init(ignore_reinit_error=True, local_mode=True)
# Trace TensorFlow, if needed.
# os.mkdir("/tmp/tf_timeline_test")
# os.environ["TF_TIMELINE_DIR"] = "/tmp/tf_timeline_test"

algorithm = config.build()
for i in range(10):
    print(f"========================================{i}===========================================")
    algorithm.train()

ray.shutdown()

Maybe @sven1977 or Jun Gong have an answer to this (I know you work with TensorFlow)?

@Lars_Simon_Zehnder SampleBatch should ideally never contain a differentiable component if your code needs to work in distributed mode. If SampleBatch contains any tensor with gradient info it will lose that info as it goes thru the object store. So I am curious why you ended up with having a Tensor in your SampleBatch to begin with. For these types of exploration methods can’t you just add the value of the intrinsic reward to the extrinsic one during postprocess_trajectory call?

1 Like

@kourosh thanks for your reply!

I took a look at the PPO value targets and advantages before I started to code this module. Therein the same variables (for the 1st value head) are stored into the sample batch during postprocess_trajectory().
If the sample batch gets passed into the policy’s loss() function it contains already only Tensors as the loss is then evaluated in a session run by RunBuilder.

So what works in evaluating the Tensors is, if I use a monkey patch on the policy’s stats_fn() and add therein also the intrinic_value_loss the loss gets evaluated and can be seen on TensorBoard. However this puts the metric into the tune\evaluation instead the custom_metrics/rnd/ where they should resit to separate clearly the metrics from the policy fromm the ones from the exploration module.

I hope I could clarify a little more where this setup came from.