The *ambition* of this page is to discuss (and hopefully dispel) the myth: Non-Markovianness invalidates most (basic) RL methods.

This myth says that all the basic RL algorithms, Q-learning, TD(lambda), Sarsa(lambda), Bellman-gradient, etc., depend on the Markovian assumption, i.e., that observations are state and hence are simply inapplicable to non-Markovian RL problems. And of course, most problems of interest are non-Markovian and hence basic RL algorithms are not very applicable...

There are several reasons for this myth

- Most (though not all) of the theoretical
*convergence* results for these basic RL algorithms assume Markovianness. The reason for this is obvious - the math becomes easy.
- There are very simple non-Markovian problems in which specific RL algorithms such as Q-learning with lookup tables and without eligibiilty traces converges to bad policies.
- Some folks have experienced trouble with naive implementations of say lookup table Q-learning to non-Markovian problems and other folks have heard this trouble described.

The reasons that this is a myth are also many.

- The fact that the
*currently available* positive convergence results for the basic RL algorithms assume Markovianness does not imply anything about how these algorithms deal with non-Markovianness.
- The negative examples of problems where the basic RL algorithms have trouble with non-Markovianness are just that - examples that are troubling for the s
*implest* instantiations of the basic RL algorithms, the point here being that one would never use these forms of the basic RL algorithms in any real application of interest.
- As noted above, most (if not all) of the RL problems of interest are non-Markovianness and hence the many RL success stories on this page attest to the ability of RL algorithms to deal quite successfully with non-Markovianness. This is perhaps the strongest argument as to why this is a myth.

Now lets consider a few of the ways in which RL algorithms successfully deal with hidden state.

- Eligibility traces: there is empirical evidence that eligbility traces (being more Monte Carlo) are able to deal well with hidden state when added to basic RL algorithms like TD and Q-learning.
- Feature construction: when faced with any real problem the researcher/practioner designs state features that can include features derived from memory of past observations/features to build good estimates of state.
- Policy search methods: RL methods that are based on some explicit representation of a space of policies along with an algorithm for searching among the space of policies can easily be immune to the problem of hidden state.

## Comments (0)

You don't have permission to comment on this page.