Arbitrary action/observation space

Using FlexDict and Repeated spaces can provide great flexibility, but sometimes it’s still not enough.

I would assume that any object could be used for observations (and actions) as long as the Ray backend can handle it. The space would require a sample method, which might not be trivial (but it should be handled by the users anyway).

So is it required to have all these constraints or can there by just a simple way that gets the observations from the envs to the model without preprocessing, flattening, validating, etc.?

E.g.

class NoPreproc(Preprocessor):
    def _init_shape(self, obs_space: gym.Space, options: dict):
        return self._obs_space.shape

    def transform(self, observation):
        return observation

    # Is this necessary?
    # def write(self, observation, array,
    #           offset: int) -> None:
    #     array[offset:offset + self._size] = np.array(observation,
    #                                                  copy=False).ravel()

    @property
    def observation_space(self) -> gym.Space:
        return self._obs_space


class CustomSpace(gym.Space):
    def __init__(self):
        super().__init__()
        self._shape = 1 # required for preproc?
        self.max_len = 10

    def sample(self):
        size = np.random.randint(1, self.max_len)
        return np.random.rand(size)

    # OR
    # def sample(self):
    #    return custom_object

    def contains(self, x):
        return True

but this raises the following:

(pid=1661960)   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/_private/function_manager.py", line 566, in actor_method_executor
(pid=1661960)     return method(__ray_actor, *args, **kwargs)
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 516, in __init__
(pid=1661960)     self.policy_map, self.preprocessors = self._build_policy_map(
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1127, in _build_policy_map
(pid=1661960)     preprocessor = ModelCatalog.get_preprocessor_for_space(
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/rllib/models/catalog.py", line 638, in get_preprocessor_for_space
(pid=1661960)     prep = _global_registry.get(RLLIB_PREPROCESSOR, preprocessor)(
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/tune/registry.py", line 135, in get
(pid=1661960)     value = _internal_kv_get(_make_key(category, key))
(pid=1661960)   File "/home/user/miniconda3/envs/mahenv/lib/python3.8/site-packages/ray/tune/registry.py", line 105, in _make_key
(pid=1661960)     key.encode("ascii"))
(pid=1661960) AttributeError: type object 'NoPreproc' has no attribute 'encode'

Actually that error was because I didn’t register the preprocessor, so ignore that.
However, RLlib expects NP arrays, e.g. in rllib/policy/policy.py:

821                         ret[view_col] = np.zeros_like([                                                                                                                                                    
822                             view_req.space.sample() for _ in range(batch_size)                                                                                                                             
823                         ])   

which doesn’t make sense only when sample() returns a NP array.

Would it be possible to move the arbitrary action space to an action mask? That way, the action space can be constant, which is a fundamental assumption for RL algorithms.

To be precise, make a wrapper Env where the action space size is self.max_len and when stepping through the enviornment, apply a mask that zeroes out indexes of the action space.

Thanks for the suggestion. In a way I think the repeated space provides a similar functionality naturally, which does help a bit, but in general it would make things easier if there was no need for encoding everything as a NP array.