I wonder which of algorithms https://docs.ray.io/en/latest/rllib/rllib-algorithms.html can be use to solve probabilistic environments like Rock paper scissors game of contextual bandit? Is there special configuration parameter to use probabilitic policy.
What is the best way to define observation if the observation is always the same.
Hey @Peter_Pirog,
Have a look at the [contextual bandits section](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html#contextual-bandits)
, Linear Upper confidence Bound and Thompson Sampling are both algorithms to solve such environments.
What is the best way to define observation if the observation is always the same.
What do you mean by that? Can you explain, please?
@arturn , Maybe I dont understand correctly but if I have 3 bandits and I can use any of them for N times, in each of N iterations my observations are the same - available 3 bandits, for axample
iteration 1 - can use any of bandit [1,1,1]
iteration 2 - can use any of bandit [1,1,1]
iteration 3 - can use any of bandit [1,1,1] etc.
The difference is in the reward but obserwation is always the same where postion of 1 in the list shows which bandit is available [0,1,0] means I can use only bandit 1, bandit 0 and bandit 2 are unavailable.
@Peter_Pirog You should be able to use existing contextual bandit algorithms we currently have in RLlib for your problem. Contextual bandit is essentially a super-set of the problem with fixed observation that you mentioned. You should just create an environment that always returns a fixed observation. I hope this helps.
@kouros, Thank You for the answer:
You should just create an environment that always returns a fixed observation.