How to write a trainable - for tuning a deterministic policy?

Hi guys,

sounds like a paradox, but becomes clear soon. I have a deterministic policy with a couple of hyperparameters I want to use as a baseline in comparison to learning policies.

So far I have run the policy within a while loop using compute_action with the agent and step with the environment receiving a reward after each episode played. I see that tune.run can tune hyperparameters, but needs a trainable. How can I create such a trainable - program a Trainer class?

@sven1977 @kai told me at Ray Summit to tag you here, so I do :slight_smile:

Best,

Simon

Hey @Lars_Simon_Zehnder ,
Seems like you have:

  • An algorithm you would like to use to train your actual (learning) policy, e.g. PPO.
  • A policy, that’s frozen, but you’d like to evaluate against (inside some Trainer).

You should probably do like it’s shown in this example script here:
ray/rllib/examples/multi_agent_custom_policy.py

from ray.rllib.examples.policy.random_policy import RandomPolicy

    config = {
        "multiagent": {
            "policies": {
                "ppo_policy": (None, obs_space, act_space, {}),
                "random": (RandomPolicy, obs_space, act_space, {}),
            },
            "policy_mapping_fn": (
                lambda aid, **kwargs: ["pg_policy", "random"][aid % 2]),
        },
        "evaluation_interval": 1,
        "evaluation_config": {
            "multiagent": {
                "policies": {
                    "ppo_policy": (None, obs_space, act_space, {}),
                    "random": (RandomPolicy, obs_space, act_space, {}),
                },
                # ? For evaluation, always use the RandomPolicy ?
                "policy_mapping_fn": lambda agent_id, **kwargs: "random",
            },
        },
    }

    # Use PPOTrainer, but - depending on eval vs train - will use either the main ppo_policy or the random one.
    results = tune.run("PPO", config=config, stop=stop, verbose=1)

Not exactly sure this is what you’d like to achieve, but it shows nicely that different policies can be used inside any Trainer. Note here, you can also set the list of policy IDs that RLlib should train via the multiagent.policies_to_train list, e.g. ["ppo_policy"]. It’s not necessary to add “random” here, b/c RandomPolicy does nothing upon calling its “learn_on_batch” method, but it’d probably be cleaner to add it to this list here. I’ll change our example and add a comment to explain.

@sven1977 Thanks for your quick response on this! Your suggestion is almost what I want. The RandomPolicy is already helping me to understand better how to work with a non-trainable policy. However, the difference is that I will not have a multi-agent setting. Instead I want to have a single hard-coded policy that has only hyperparameters that can be tuned - the policy is non-learning. Something a human might do when coding some rules into an algorithm.

To bring an example here is a custom evironment I use for testing:

class MyEnv(gym.Env):

    def __init__(self, config=None):
        config = config or {}

        self.price = config.get("price", 10)
        self.mu = config.get("mu", 0.4)
        self.sigma = config.get("sigma", 0.1)

        self.timestep_limit = config.get("ts", 100)

        observation_spaces = {
            "price":   Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64),
            "position": Box(low=-self.timestep_limit, high=self.timestep_limit, shape=(1,), dtype=np.int32),
            "entry": Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64)
        }
        self.observation_space = Dict(observation_spaces)
        self.action_space = Discrete(3)

        self.reset()

    def reset(self):

        self.price += np.random.normal(self.mu)*self.sigma
        self.reward = 0.0
        self.cumulated_reward = 0.0
        self.position = 0
        self.entry = 0.0

        self.timesteps = 0

        return self._get_obs()

    def _get_obs(self):

        return {
            "price": np.array([self.price], dtype=np.float64),
            "position": np.array([self.position], dtype=np.int32),
            "entry": np.array([self.entry], dtype=np.float64)
        }

    def step(self, action):

        action = -1 if action == 2 else action
        obs = self._get_obs()
        self.timesteps += 1
        is_done = self.timesteps >= self.timestep_limit
    
        self.position += action
        self.entry = np.absolute(action)*self.price
        self.price += np.random.normal(self.mu)*self.sigma

        self.reward = (self.price-self.entry)*self.position
        self.cumulated_reward += self.reward

        return obs, self.reward, is_done, {}
        
    def render(self, mode=None):
        print("Iteration: {}".format(self.timesteps))
        print("Cumulated reward: {}".format(round(self.cumulated_reward, ndigits=2)))
        print('Position: {}'.format(self.position))
        print()

And here is my custom policy:

class DummyTrainer(Policy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        self.multiplicator = self.config.get("multiplicator", 1)

    def compute_actions(self, 
                        obs=None,
                        state_batches=None,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        **kwargs):
        return np.random.choice(3)

    def learn_on_batch(self, samples):
        return

    @override(Policy)
    def get_weights(self) -> ModelWeights:
        """No weights to save."""
        return {}

    @override(Policy)
    def set_weights(self, weights: ModelWeights) -> None:
        """No weights to set."""
        pass

Note, here multiplicator is a hyperparameter which I then would like to tune (makes no sense in this example). I build then a trainer and run it:

ray.init(ignore_reinit_error=True)

MyTrainer = build_trainer(
    name="DummyTrainer",
    default_policy=DummyTrainer
)

config = {
    "env": MyEnv,
    "env_config": {
        "config": {
            "ts": 50,
            "mu": 20,
            "sigma": 0.05,
        },
    },
    "create_env_on_driver": True, 
}

my_trainer = MyTrainer(config=config)
results = my_trainer.train()

This gives me the following error:

RayTaskError(TypeError): ray::RolloutWorker.par_iter_next() (pid=55716, ip=192.168.1.111)
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py", line 1152, in par_iter_next
    return next(self.local_it)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 332, in gen_rollouts
    yield self.sample()
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 706, in sample
    batches = [self.input_reader.next()]
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 96, in next
    batches = [self.get_data()]
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 223, in get_data
    item = next(self.rollout_provider)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 634, in _env_runner
    _process_policy_eval_results(
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1064, in _process_policy_eval_results
    actions: TensorStructType = eval_results[policy_id][0]
TypeError: 'int' object is not subscriptable

Somewhere an array is expected and an integer is provided, possibly my action? This is where I am stuck right now. I still do not understand the workflow in RLlib. What is happening there. I tried to read through the source code of sampler.py, but without debugging I cannot understand what my explicit values are that cause the exception above.

When trainer.train() runs than tune.run() possibly runs as well with DummyTrainer.multiplicator as hyperparameter.

I am thankful for any help.

Hi @Lars_Simon_Zehnder,

The observations to compute_actions has a batch dimension. You are currently ignoring that batch dimension when you generate the random actions. It should be:

batch_size = obs.shape[0]
return np.random.choice(3,size=[batch_size,1])

Also, you could use get_ and set_weights to save / restore the hyperparameters you are tuning.

Hi @mannyv, thanks for considering my question. But that would mean that the obs object cannot be dict anymore, but instead has to be array-like? Do I also have then to consider batch_size-actions in the environment’s step() function?

I thought one could use scalar actions when using train() when reading through the tutorial of @sven1977. Does that only hold when using the MultiAgentPolicy?

Thanks for your help.

Hi @Lars_Simon_Zehnder,

Sorry I had not considered you were using a DictSpace. You can still do it like this you just need to pull a value out of the dictionary and use it on that value.

batch_size = obs["your_obs_key"].shape[0]
return np.random.choice(3,size=[batch_size,1]) # You could make this more general by getting the 3 from self.action_space but I do not know what kind of action space you are using and how you get the size changes based on the type.

Your environment does not need to worry about batch sizes. Since you are only using compute_actions, I think, there is really only one parameter that will affect the batch size. This is num_envs_per_worker. That will determine your batch size.

If you search here for compute_actions it lists the return type as follows. I think you will also need to return [] and {} for state_outs and info but I am not certain about that. Something like this: return np.random.choice(3,size=[batch_size,1]), [], {}

Returns
    actions (TensorType): Batch of output actions, with shape like
        [BATCH_SIZE, ACTION_SHAPE].

    state_outs (List[TensorType]): List of RNN state output
        batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

    info (List[dict]): Dictionary of extra feature batches, if any,
        with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

https://docs.ray.io/en/master/rllib-package-ref.html

Two implementation details you do not need to worry about here but you seem interested in so I will explain.

  1. Behind the scenes rllib will automatically convert most environments into a vector environment. The vector size is determined by the num_envs_per_worker. Rllib handles interacting with each environment individually so they do not need to be concerned with this detail. Sometimes it is possible to handle your environment as a vectorenv more efficiently than the default way rllib does in which case you can provide a custom implementation of the vector environment.

  2. Rllib will automatically convert most environments into a multiagent env. If you do not have a multiagent environment then it is turned into a multiagent env with 1 agent. This again happens behind the scenes and you do not need to worry about it in your implementation. Rllib handles it so that from the perspective of the environment and the policy it still looks like a single agent environment.

Thanks a lot @mannyv for your elaborate answer. It took me some time to test out and debug example code that implements your suggestions. However, I always get a weird error that in debugging cannot be analyzed.

My environment looks now like this:

class MyEnv(gym.Env):

    def __init__(self, config=None):
        config = config or {}

        self.price = config.get("price", 10)
        self.mu = config.get("mu", 0.4)
        self.sigma = config.get("sigma", 0.1)

        self.timestep_limit = config.get("ts", 100)

        observation_spaces = {
            "price":   Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64),
            "position": Box(low=-self.timestep_limit, high=self.timestep_limit, shape=(1,), dtype=np.int32),
            "entry": Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64)
        }
        self.observation_space = Dict(observation_spaces)
        self.action_space = Discrete(3)

        self.reset()

    def reset(self):

        self.price += np.random.normal(self.mu)*self.sigma
        self.reward = 0.0
        self.cumulated_reward = 0.0
        self.position = 0
        self.entry = 0.0

        self.timesteps = 0

        return self._get_obs()

    def _get_obs(self):

        return {
            "price": np.array([self.price], dtype=np.float64),
            "position": np.array([self.position], dtype=np.int32),
            "entry": np.array([self.entry], dtype=np.float64)
        }

    def step(self, action):

        action, _, _ = action
        action = -1 if action == 2 else action        
        self.timesteps += 1
        is_done = self.timesteps >= self.timestep_limit
    
        self.position += action
        self.entry = np.absolute(action)*self.price
        self.price += np.random.normal(self.mu)*self.sigma

        self.reward = (self.price-self.entry)*self.position
        self.cumulated_reward += self.reward
        obs = self._get_obs()

        return obs, self.reward, is_done, {}
        
    def render(self, mode=None):
        print("Iteration: {}".format(self.timesteps))
        print("Cumulated reward: {}".format(self.cumulated_reward))
        print('Position: {}'.format(self.position))
        print()

and my policy as follows:

from ray.rllib.policy.policy import Policy
from ray.rllib.utils.annotations import override
from ray.rllib.utils.typing import ModelWeights

class DummyTrainer(Policy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        self.multiplicator = self.config.get("multiplicator", 1)

    def compute_actions(self, 
                        obs=None,
                        state_batches=None,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        **kwargs):
        batch_size = obs['price'].shape[0]                  
        return np.random.choice(3, size=[batch_size,1]), [], {}

    def learn_on_batch(self, samples):
        return

    @override(Policy)
    def get_weights(self) -> ModelWeights:
        """No weights to save."""
        return {}

    @override(Policy)
    def set_weights(self, weights: ModelWeights) -> None:
        """No weights to set."""
        pass

When I build my trainer and call train() on it as follows

from ray.rllib.agents.trainer_template import build_trainer

ray.init(ignore_reinit_error=True)

from ray.rllib.agents.ppo import PPOTrainer

MyTrainer = build_trainer(
    name="DummyTrainer",
    default_policy=DummyTrainer
)

config = {
    "env": MyEnv,
    "env_config": {
        "config": {
            "ts": 50,
            "mu": 20,
            "sigma": 0.05,
        },
    },
    "num_workers": 1,
    "log_level": "DEBUG",
    "create_env_on_driver": True,  
}

my_trainer = MyTrainer(config=config)

I get the following output from it:

2021-07-03 14:07:49,084	INFO services.py:1267 -- View the Ray dashboard at http://127.0.0.1:8266
2021-07-03 14:07:50,569	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=126088) WARNING:tensorflow:From /home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=126088) Instructions for updating:
(pid=126088) non-resource variables are not supported in the long term
(pid=126088) WARNING:tensorflow:From /home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=126088) Instructions for updating:
(pid=126088) non-resource variables are not supported in the long term
2021-07-03 14:07:52,595	DEBUG rollout_worker.py:1122 -- Creating policy for default_policy
2021-07-03 14:07:52,596	DEBUG preprocessors.py:249 -- Creating sub-preprocessor for Box(0.0, inf, (1,), float64)
2021-07-03 14:07:52,596	DEBUG preprocessors.py:249 -- Creating sub-preprocessor for Box(-100, 100, (1,), int32)
2021-07-03 14:07:52,597	DEBUG preprocessors.py:249 -- Creating sub-preprocessor for Box(0.0, inf, (1,), float64)
2021-07-03 14:07:52,597	DEBUG catalog.py:631 -- Created preprocessor <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fd85c00d370>: Dict(entry:Box(0.0, inf, (1,), float64), position:Box(-100, 100, (1,), int32), price:Box(0.0, inf, (1,), float64)) -> (3,)
2021-07-03 14:07:52,598	INFO rollout_worker.py:1161 -- Built policy map: {'default_policy': <__main__.DummyTrainer object at 0x7fd85c00d850>}
2021-07-03 14:07:52,599	INFO rollout_worker.py:1162 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7fd85c00d370>}
2021-07-03 14:07:52,600	DEBUG rollout_worker.py:531 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
2021-07-03 14:07:52,602	INFO rollout_worker.py:563 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7fd85c00d340>}
2021-07-03 14:07:52,603	DEBUG rollout_worker.py:678 -- Created rollout worker with env <ray.rllib.env.base_env._VectorEnvToBaseEnv object at 0x7fd85c00dd00> (<MyEnv instance>), policies {'default_policy': <__main__.DummyTrainer object at 0x7fd85c00d850>}
2021-07-03 14:07:52,608	WARNING util.py:53 -- Install gputil for GPU system monitoring.
<ray.rllib.agents.trainer_template.DummyTrainer at 0x7fd85c086100>

and from my_trainer.train():

2021-07-03 14:07:58,877	INFO trainer.py:569 -- Worker crashed during call to train(). To attempt to continue training without the failed worker, set `'ignore_worker_failures': True`.
---------------------------------------------------------------------------
RayTaskError(IndexError)                  Traceback (most recent call last)
<ipython-input-10-5523c1af32b5> in <module>
----> 1 results = my_trainer.train()

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py in train(self)
    571                         "continue training without the failed worker, set "
    572                         "`'ignore_worker_failures': True`.")
--> 573                     raise e
    574             except Exception as e:
    575                 time.sleep(0.5)  # allow logs messages to propagate

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py in train(self)
    560         for _ in range(1 + MAX_WORKER_FAILURE_RETRIES):
    561             try:
--> 562                 result = Trainable.train(self)
    563             except RayError as e:
    564                 if self.config["ignore_worker_failures"]:

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/tune/trainable.py in train(self)
    230         """
    231         start = time.time()
--> 232         result = self.step()
    233         assert isinstance(result, dict), "step() needs to return a dict."
    234 

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py in step(self)
    160         @override(Trainer)
    161         def step(self):
--> 162             res = next(self.train_exec_impl)
    163 
    164             # self._iteration gets incremented after this function returns,

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in __next__(self)
    754     def __next__(self):
    755         self._build_once()
--> 756         return next(self.built_iterator)
    757 
    758     def __str__(self):

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_foreach(it)
    781 
    782             def apply_foreach(it):
--> 783                 for item in it:
    784                     if isinstance(item, _NextValueNotReady):
    785                         yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_filter(it)
    841     def filter(self, fn: Callable[[T], bool]) -> "LocalIterator[T]":
    842         def apply_filter(it):
--> 843             for item in it:
    844                 with self._metrics_context():
    845                     if isinstance(item, _NextValueNotReady) or fn(item):

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_filter(it)
    841     def filter(self, fn: Callable[[T], bool]) -> "LocalIterator[T]":
    842         def apply_filter(it):
--> 843             for item in it:
    844                 with self._metrics_context():
    845                     if isinstance(item, _NextValueNotReady) or fn(item):

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_foreach(it)
    781 
    782             def apply_foreach(it):
--> 783                 for item in it:
    784                     if isinstance(item, _NextValueNotReady):
    785                         yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_flatten(it)
    874     def flatten(self) -> "LocalIterator[T[0]]":
    875         def apply_flatten(it):
--> 876             for item in it:
    877                 if isinstance(item, _NextValueNotReady):
    878                     yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in add_wait_hooks(it)
    826                             fn._on_fetch_start()
    827                         new_item = False
--> 828                     item = next(it)
    829                     if not isinstance(item, _NextValueNotReady):
    830                         new_item = True

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_foreach(it)
    781 
    782             def apply_foreach(it):
--> 783                 for item in it:
    784                     if isinstance(item, _NextValueNotReady):
    785                         yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_foreach(it)
    781 
    782             def apply_foreach(it):
--> 783                 for item in it:
    784                     if isinstance(item, _NextValueNotReady):
    785                         yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in apply_foreach(it)
    781 
    782             def apply_foreach(it):
--> 783                 for item in it:
    784                     if isinstance(item, _NextValueNotReady):
    785                         yield item

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py in base_iterator(timeout)
    469             while active:
    470                 try:
--> 471                     yield ray.get(futures, timeout=timeout)
    472                     futures = [a.par_iter_next.remote() for a in active]
    473                     # Always yield after each round of gets with timeout.

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
     45         if client_mode_should_convert():
     46             return getattr(ray, func.__name__)(*args, **kwargs)
---> 47         return func(*args, **kwargs)
     48 
     49     return wrapper

~/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
   1479                     worker.core_worker.dump_object_store_memory_usage()
   1480                 if isinstance(value, RayTaskError):
-> 1481                     raise value.as_instanceof_cause()
   1482                 else:
   1483                     raise value

RayTaskError(IndexError): ray::RolloutWorker.par_iter_next() (pid=126088, ip=192.168.1.111)
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/util/iter.py", line 1152, in par_iter_next
    return next(self.local_it)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 332, in gen_rollouts
    yield self.sample()
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 706, in sample
    batches = [self.input_reader.next()]
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 96, in next
    batches = [self.get_data()]
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 223, in get_data
    item = next(self.rollout_provider)
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 622, in _env_runner
    eval_results = _do_policy_eval(
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1003, in _do_policy_eval
    policy.compute_actions_from_input_dict(
  File "/home/simon/git-projects/forex-strategy-learning/.venv/lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 280, in compute_actions_from_input_dict
    return self.compute_actions(
  File "<ipython-input-3-32722a609754>", line 17, in compute_actions
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Debugging my code I get always when compute_actions is called the obs object and can also extract obs['price'] with shape[0] equal to one. Therefore I do not understand the error, which somehow points towards a non-integer index in numpy. Does anyone has an idea? @mannyv, @kai, or @sven1977 maybe?

Thanks in advance to everyone who reads and thinks :wink:

@Lars_Simon_Zehnder

Here are the changes that I made to get it to work for me. Anywhere you see a comment with “FIX” means I made a change:

class DummyTrainer(Policy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.multiplicator = self.config.get("multiplicator", 1)

    def compute_actions(self,
                        obs=None,
                        state_batches=None,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        **kwargs):
        batch_size = obs.shape[0]
        if isinstance(self.action_space, Discrete):
            action_size = self.action_space.n
        else:
            raise ValueError(f"Ony Discrete spaces supported now. You need to implement a way to action_size for action_spape type {type(action_space)}")
        #FIX:return np.random.choice(action_size, size=[batch_size,1]) , [], {}
        return np.random.choice(action_size, size=[batch_size]) , [], {}

    def learn_on_batch(self, samples):
        return

    @override(Policy)
    def get_weights(self) -> ModelWeights:
        """No weights to save."""
        return {}

    @override(Policy)
    def set_weights(self, weights: ModelWeights) -> None:
        """No weights to set."""
        pass
class MyEnv(gym.Env):

    def __init__(self, config=None):
        config = config or {}

        self.price = config.get("price", 10)
        self.mu = config.get("mu", 0.4)
        self.sigma = config.get("sigma", 0.1)

        self.timestep_limit = config.get("ts", 100)

        observation_spaces = {
            "price":   Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64),
            "position": Box(low=-self.timestep_limit, high=self.timestep_limit, shape=(1,), dtype=np.int32),
            "entry": Box(low=0.0, high=np.inf, shape=(1,), dtype=np.float64)
        }
        self.observation_space = Dict(observation_spaces)
        self.action_space = Discrete(3)

        self.reset()

    def reset(self):

        self.price += np.random.normal(self.mu)*self.sigma
        self.reward = 0.0
        self.cumulated_reward = 0.0
        self.position = 0
        self.entry = 0.0

        self.timesteps = 0

        return self._get_obs()

    def _get_obs(self):

        return {
            "price": np.array([self.price], dtype=np.float64),
            "position": np.array([self.position], dtype=np.int32),
            "entry": np.array([self.entry], dtype=np.float64)
        }

    def step(self, action):

        #FIX:action, _, _ = action <- action is an int not an iterable
        action = -1 if action == 2 else action
        self.timesteps += 1
        is_done = self.timesteps >= self.timestep_limit

        self.position += action
        self.entry = np.absolute(action)*self.price
        self.price += np.random.normal(self.mu)*self.sigma

        #FIX:self.reward = (self.price-self.entry)*self.position <- this was sometimes a ndarray and needs to be an int
        self.reward = int((self.price-self.entry)*self.position)
        self.cumulated_reward += self.reward
        obs = self._get_obs()

        return obs, self.reward, is_done, {}

    def render(self, mode=None):
        print("Iteration: {}".format(self.timesteps))
        print("Cumulated reward: {}".format(self.cumulated_reward))
        print('Position: {}'.format(self.position))
        print()

Thank you for this extra effort you put in this @mannyv. This helped me a lot. It also ran now on my side and I also used it in a tune.run() and got it running there, too. I consider this as the solution to my problem.

For using tune.run() I had to change my DummyTrainer class slightly such that I could pass via the config dictionary a model_config that defines the parameters to be tuned. I post my code here, for other folks that might have a similar question:

from ray.rllib.policy.policy import Policy
from ray.rllib.utils.annotations import override
from ray.rllib.utils.typing import ModelWeights

class DummyTrainer(Policy):
    def __init__(self, model_config, *args, **kwargs):
        super().__init__(model_config, *args, **kwargs)
        config = model_config or {}
        self.multiplicator = self.config.get("multiplicator", 1)

    def compute_actions(self, 
                        obs=None,
                        state_batches=None,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        **kwargs):
        batch_size = obs.shape[0] 
        if isinstance(self.action_space, Discrete):
            action_size = self.action_space.n 
        else:
            raise ValueError(f"Only Discrete spaces supported now. You need \
                to implement a way to action_size for action_space {type(action_space)}" )

        return np.random.choice(action_size, size=[batch_size]), [], {}

    def learn_on_batch(self, samples):
        return

    @override(Policy)
    def get_weights(self) -> ModelWeights:
        """No weights to save."""
        return {}

    @override(Policy)
    def set_weights(self, weights: ModelWeights) -> None:
        """No weights to set."""
        pass

Then I use the following configs for tune:

config = {
    "env": MyEnv,
    "env_config": {        
        "ts": 50,
        "mu": 20,
        "sigma": 0.05,        
    },
    "num_workers": 3,
    "create_env_on_driver": True, 
    "model": {
        "custom_model_config": {
            "multiplicator": tune.grid_search([1, 2, 3, 4]),                
        },
    }, 
}

stop_config = {
    #"episode_reward_mean": 200,
    "training_iteration": 10,
}

where multiplicator is a dummy hyperparameter that can be tuned. I run this via:

tune_analysis = tune.run(MyTrainer, config=config, time_budget_s=20)