• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Algorithms of Reinforcement Learning

Page history last edited by Satinder Singh 15 years, 2 months ago

The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.

Associative Reinforcement Learning Algorithms

  • (A_{R-P} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-374.


  • (The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. (link)


Temporal Differences (TD)

  • (The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44. (pdf)
  • (Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123-158. (link)
  • Page with links to Theoretical Analyses of TD(lambda)



  • (The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
  • (Link to an online paper with description of Q-learning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)
  • Page with links to Theoretical Analyses of Q-learning


Dyna        (shows that learning and planning are related and integral to RL)

  • (Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann (link)


Real-Time Dynamic Programming

  • (Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)


SARSA     (onpolicy method for control)

  • (Algorithm) Rummery G. and Niranjan, M. (1994). On-line q-learning using connectionist systems,Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department. (link)
  • Page with links to Theoretical Analyses of Sarsa



  • (A precursor) S. J., and Barto, A. G., Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
  • (Algorithm) Boyan, J. A., Technical Update: Least-Squares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).


LSPI     (Least Squares Policy Iteration)

  • (Algorithm)  Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149. (pdf)



  • (Algorithm) Moore A. and Atkeson C. (1995). The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces. In Machine Learning, Vol 21. (link)


Prioritized Sweeping

  • (Algorithm) Moore A. and Atkeson C. (1993) Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time, In Machine Learning, Vol 13, pages 103-130. (link)



  • (Main Algorithm) Kearns M. and Singh S. (2002). Near-Optimal Reinforcement Learning in Polynomial Time, In Machine Learning journal, Volume 49, Issue 2, pages 209-232, 2002. (pdf)
  • (For Factored MDPs) Kearns M. and Koller D. (1999). Efficient Reinforcement Learning in Factored MDPs.  Proceedings of the Sixteenth International Joint  Conference on Artificial Intelligence,  Morgan Kaufmann, pages 740--747 (pdf)
  • (In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003).  Exploration in Metric State Spaces.  In ICML. (pdf)



  • (Algorithm) Kearns M., Mansour Y. and Ng A. (1999). A Sparse Sampling Algorithm for Near-Optimal  Planning in Large Markov Decision Processes. Proceedings of the Sixteenth International Joint  Conference on Artificial Intelligence  Morgan Kaufmann, pages 1324--1331.  Also in the journal  Machine Learning. (pdf)


Trajectory Tree

  •  (Algorithm) Kearns M., Mansour Y. and Ng A. (2000). Approximate planning in large POMDPs via reusable trajectories. In NIPS 12, 2000. (pdf)


Policy Search Algorithms

  • (PEGASUS) Ng A. and Jordan M. (2000). PEGASUS: A policy search method for large MDPs and POMDPs, In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference. (pdf)


Policy Gradient Algorithms

  • (Reinforce) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256. (link)
  • (Algorithm) Sutton R., McAllester D., Singh S. and Mansour Y (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12 (NIPS), 2000 (link).
  • J. Baxter and P. L. Bartlett.  Infinite-horizon gradient-based policy search.  Journal of Artificial Intelligence Research, 15:319-350, 2001.
  • J. Baxter and P. L. Bartlett.  GPOMDP: An on-line algorithm for estimating performance gradients  in POMDP's, with applications.  In Proceedings of the 2000 International Conference on Machine  Learning, pages 41-48, 2000.
  • (*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.


Linear Programming based RL methods

  • D. P. de Farias and B. Van Roy, A Linear Program for Bellman Error Minimization with Performance Guarantees, In Mathematics of Operations Research. (pdf)


Algorithms for Structured MDPs/POMDPs

  • C. Guestrin, M. Hauskrecht and B. Kveton. Solving Factored MDPs with Continuous and Discrete Variables. In the Twentieth Conference on  Uncertainty in Artificial Intelligence, Banff, Canada, July 2004. (link)
  •  C. Guestrin, D. Koller, R. Parr, and S. Venkataraman Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research, volume 19, pages 399-468, 2003. (pdf)
  •  P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin Greedy Linear Value Function Approximation for Factored Markov Decision Processes. In Proceedings ofthe Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, AB, pages 285--291, 2002. (pdf)
  • C. Boutilier, R. Dearden, and M. Goldszmidt, Stochastic Dynamic Programming with Factored Representations, Artificial Intelligence, 121(1), pages 49--107, 2000. (link)


Hierarchical Reinforcement Learning Algorithms

  • (Options) Sutton, R. S.,Precup, D., Singh, S. (1999) Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. In Artificial Intelligence, vol. 112, pp.181-211 (pdf)
  •  (MAXQ) Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227-303 (link)
  • (HAMs) Ronald Parr and Stuart Russell (1997). Reinforcement Learning with Hierarchies of Machines. NIPS 97 (link)
  •  (A Survey) Andrew G. Barto, Sridhar Mahadevan (2003). Recent Advances in Hierarchical Reinforcement Learning. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 41-77, 2003. (pdf)

Comments (0)

You don't have permission to comment on this page.