RL does not work well with function approximation


The ambition of this page is to discuss (and hopefully dispel) the myth that RL does not work well with function approximation.


There are several reasons for the existence of this myth.

  1. A fairly common beginning graduate student experience is as follows: I know that sigmoidal neural networks with backpropagation [replace with your favorite function approximation method] are supposed to work and that Q-learning is supposed to work and I put them together on this moderate sized problem and it didn't work!
  2. There are by now a few apocryphal horror stories about experiences with RL and function approximation that make the round at conferences and workshops that bolster this myth.
  3. There are oft-cited theoretical results that state that Q-learning/TD can diverge with linear function approximation. This seems really damning because we would really like to be able to use linear function approximation. And if linear function approximators cannot be trusted to work with RL, then things seem dim at best.

 

The reality, on the other hand, is that RL works very well with function approximation. All the many small and large successes of RL have involved the use of function approximators including linear function approximators, memory-based methods, CMACs, sigmoidal neural networks, SVMs, rules, etc. Scanning a list of RL applications it should be uncontroversial that RL works well with function approximation. Indeed, one of the main ways in which RL ideas contribute to the previous theory of dynamic programming (DP) is in the ease of use of function approximation. Lets consider this in more detail.

  1. RL is typically on-policy while DP is off-policy. It is known that function approximation under on-policy distributions is better behaved than under off-policy distributions (someone please add ref. here).
  2. The use of eligibility traces allow RL to take on the properties of Monte Carlo methods and these are much more robust to function approximation problems. The extreme case of pure Monte Carlo for policy evaluation never diverges with function approximation.

The fact that Bertsekas & Tsitsiklis's book on reinforcement learning is called "Neuro-Dynamic Programming" is not a coincidence. I would conjecture that the reason RL is of such great interest to many OR folks is because RL is much better behaved with function approximators (such as neural networks) than their favourite DP  methods.

But what about the negative results noted above (examples of divergence of specific RL algorithms with specific function approximators)? These results are important. However all they really imply is that it is possible to construct (usually quite artificial) problems in which a specific function approximator and specific RL algorithm fail. While this should temper irrational exhuberance, it should not overcome the many positive instances of RL working successfully with function approximation.

At the same time, developing more off-the-shelf RL methods than exist today would be helpful towards really squashing this myth.