• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


RL is about learning optimal policies

Page history last edited by Satinder Singh 15 years, 4 months ago

The ambition of this page is to discuss (and hopefully dispel) the myth that RL is about learning optimal policies.


One often hears the following sort of statement. RL is about learning optimal policies and of course in any real-world AI problem it is silly to even imagine learning optimal policies and hence... (the usual implication being that RL is not the answer).

The main reason for this myth is that much of the theoretical exposition of RL algorithms such as Q-learning, TD, etc., is in terms of asymptotically achieving optimal value functions or policies. The same theory that accounts in part for the rising popularity of RL also condemns it to irrelevance to some folks. Another reason for this myth is that in much of the (early) empirical work on RL the results of learning are compared to the gold standard of optimal policies/values. Thus the word "optimal" is paired so often with "RL" that there is a firm association between the two in many a mind.

The reasons this is a myth are:

  1. There exist several RL methods that are not about optimal policies - basically any method that uses function approximation including value function approximation, approximate policy iteration, etc., automatically gives up on learning optimal policies in favour of simply learning "good" policies.
  2. Optimality or optimization is a framework for deriving reinforcement learning algorithms and not the goal in itself. This is again most clearly seen in for example policy gradient methods in which the goal is local optimality instead of global optimality.
  3. Most real applications of RL are to problems large enough that it is simply not possible to compute optimal policies and therefore in those cases the researchers simply report the quality of the policy learned. For example, helicopter control using RL by Andrew Ng and his group makes no claim about having found optimal policies, nor did the (now) ancient TDgammon work by Gerry Tesauro. The papers that compare performance against optimal policies do so because they can (surely it is a good choice for benchmark).


Comments (0)

You don't have permission to comment on this page.