A workaround is to use customize preprocessor. But I think better ideas are to (1) provide a config to disable shape check, (2) collect latest observation_space / action_space from runtime environment.
I remember in very old version of rllib, say v0.6, this issue does not exist and I can just change obs,reward,done,info whatever I like — only with the constrains that the keys in those dict are aligned with each other.
Hey @pengzh , which algorithm are you using? QMIX?
If you are using a non-MARL specific algo (using our multiagent API), your observation space should not depend on how many agents are present in the episode, but should be an individual agent’s space.
Hey Sven! I am just using independent PPO (with shared policy name, so there is only one policy).
I do find that there is a place to create remote worker with (policy_cls, env.observation_space, env.action_space, config) and that’s the reason that causes my issue.
I think, since the config['multiagent']['policies'] config should already specify the obs space for each policy, rllib should not query env.observation/action_space to create policies in remote worker. This is problematic, because the environment might have different “kinds” of agent and they might not have same spaces.
Hey @pengzh I see your problem. But the thing is that a Policy can only handle one observation space and if you have different agents using the same Policy, but having different obs spaces, that’ll not work.
The main reason is that the Policy’s model(s) would have to have different first layer weight matrices to handle the different input formats.
RLLib has made an assumption that there is only one kind of observation/action space for all policies and it is exactly env.observation_space and env.action_space, if I understand it correctly.
So this hard-coded setting might makes users hard to use multiple spaces.
Sorry, I’m still not sure I understand 100%. The code you mention above is only used in the case, where you do not provide a policy dict (and RLlib has to create one automatically (from the env’s spaces)).
But even if you specified a policies dict in your multagent config, it’s not possible to have different agents, which all use the same(!) policy (as per your policy_mapping_fn), have different observation spaces.