This is a theoretical question and apologies if this is a wrong forum.
I am trying to model a trading algorithm with the following dynamics:
S = State of market
A = agents actions
R = reward (profit/loss)
In this lets call game, the agent receives a reward if his actions (trades) are profitable.
The transition probabilities are P(S(T) | S(T-1)) i.e. the agent’s actions have no effect on the next state of the market (too many market participants).
The profit or loss is not instant. It is accrued over a period as such there is a credit assignment problem.
My questions are:
- Is this an MDP, POMDP or a semi MDP?
- Does any of RLLib algorithms work in such a scenario?