Hi @mannyv ,
thanks for your help again. I also got a little more precise in my question above. The problem in my epsiode data is the actions
array with actions from a single episode. And the first of these is different than the others. Exactly this first one makes problems in the postprocessing of an episode. My question is now - what produces this first action and what do I have to change to get as first action a similar array as the others?
The first place I would look, and maybe you already have is in the reset function of your environment. This is where the first observation will come from. Is it somehow returning something different for the observation then step is?
In the reset()
method of my environment I actually do not generate actions - is that something one should? The reset()
function in my environment returns simply the observation, whereas my step()
function returns in addition also reward
, done
, and info
. I think that should work.
If I were you I would also be concerned with those nan’s.
The nan
s come on purpose. Earlier I used None
values as an indicator that an agent
does nothing (only in the stop
variable of the action as this is a float
, I could also use 0.0
) , but this brought up errors. Would you rather suggest using 0.0
?
Are you handling the combination of Discrete and Continuous actions in a special way? I do not remember seeing rllib handle mixed action spaces but in all honesty it could be there and I have not encountered it.
Good question @mannyv ! I actually came up with this because I learned to be type
-conform and as it makes the code more readable with having names and specific types. trade
is an indicator and therefore I used discrete
values. I could of course also use a float
for the second value (trade
) and simply a Box
-space with shape=(2,)
. Maybe this is the reason for this phenomenon.
Simon