
Earn a $50 Amazon gift card for testing a new product from the makers of PBworks. Click here to apply.

If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

Algorithms of Reinforcement Learning
Page history
last edited
by Satinder Singh 10 years, 10 months ago
The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.
Associative Reinforcement Learning Algorithms
 (A_{RP} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360374.
ActorCritic
 (The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC13: 834846. (link)
Temporal Differences (TD)
 (The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 944. (pdf)
 (Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123158. (link)
 Page with links to Theoretical Analyses of TD(lambda)
Qlearning
 (The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
 (Link to an online paper with description of Qlearning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using RealTime Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81138. (link)
 Page with links to Theoretical Analyses of Qlearning
Dyna (shows that learning and planning are related and integral to RL)
 (Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353357, Morgan Kaufmann (link)
RealTime Dynamic Programming
 (Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using RealTime Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81138. (link)
SARSA (onpolicy method for control)
 (Algorithm) Rummery G. and Niranjan, M. (1994). Online qlearning using connectionist systems,Tech. Rep. Technical Report CUED/FINFENG/TR 166, Cambridge, University Engineering Department. (link)
 Page with links to Theoretical Analyses of Sarsa
LSTD
 (A precursor) S. J., and Barto, A. G., Linear LeastSquares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
 (Algorithm) Boyan, J. A., Technical Update: LeastSquares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).
LSPI (Least Squares Policy Iteration)
 (Algorithm) Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 11071149. (pdf)
PartiGame
 (Algorithm) Moore A. and Atkeson C. (1995). The Partigame Algorithm for Variable Resolution Reinforcement Learning in Multidimensional Statespaces. In Machine Learning, Vol 21. (link)
Prioritized Sweeping
 (Algorithm) Moore A. and Atkeson C. (1993) Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time, In Machine Learning, Vol 13, pages 103130. (link)
E^3
 (Main Algorithm) Kearns M. and Singh S. (2002). NearOptimal Reinforcement Learning in Polynomial Time, In Machine Learning journal, Volume 49, Issue 2, pages 209232, 2002. (pdf)
 (For Factored MDPs) Kearns M. and Koller D. (1999). Efficient Reinforcement Learning in Factored MDPs. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pages 740747 (pdf)
 (In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003). Exploration in Metric State Spaces. In ICML. (pdf)
SparseSampling
 (Algorithm) Kearns M., Mansour Y. and Ng A. (1999). A Sparse Sampling Algorithm for NearOptimal Planning in Large Markov Decision Processes. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Morgan Kaufmann, pages 13241331. Also in the journal Machine Learning. (pdf)
Trajectory Tree
 (Algorithm) Kearns M., Mansour Y. and Ng A. (2000). Approximate planning in large POMDPs via reusable trajectories. In NIPS 12, 2000. (pdf)
Policy Search Algorithms
 (PEGASUS) Ng A. and Jordan M. (2000). PEGASUS: A policy search method for large MDPs and POMDPs, In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference. (pdf)
Policy Gradient Algorithms
 (Reinforce) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 229256. (link)
 (Algorithm) Sutton R., McAllester D., Singh S. and Mansour Y (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12 (NIPS), 2000 (link).
 J. Baxter and P. L. Bartlett. Infinitehorizon gradientbased policy search. Journal of Artificial Intelligence Research, 15:319350, 2001.
 J. Baxter and P. L. Bartlett. GPOMDP: An online algorithm for estimating performance gradients in POMDP's, with applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 4148, 2000.
 (*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.
Linear Programming based RL methods
 D. P. de Farias and B. Van Roy, A Linear Program for Bellman Error Minimization with Performance Guarantees, In Mathematics of Operations Research. (pdf)
Algorithms for Structured MDPs/POMDPs
 C. Guestrin, M. Hauskrecht and B. Kveton. Solving Factored MDPs with Continuous and Discrete Variables. In the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 2004. (link)
 C. Guestrin, D. Koller, R. Parr, and S. Venkataraman Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research, volume 19, pages 399468, 2003. (pdf)
 P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin Greedy Linear Value Function Approximation for Factored Markov Decision Processes. In Proceedings ofthe Eighteenth National Conference on Artificial Intelligence (AAAI2002), Edmonton, AB, pages 285291, 2002. (pdf)
 C. Boutilier, R. Dearden, and M. Goldszmidt, Stochastic Dynamic Programming with Factored Representations, Artificial Intelligence, 121(1), pages 49107, 2000. (link)
Hierarchical Reinforcement Learning Algorithms
 (Options) Sutton, R. S.,Precup, D., Singh, S. (1999) Between MDPs and semiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning. In Artificial Intelligence, vol. 112, pp.181211 (pdf)
 (MAXQ) Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227303 (link)
 (HAMs) Ronald Parr and Stuart Russell (1997). Reinforcement Learning with Hierarchies of Machines. NIPS 97 (link)
 (A Survey) Andrew G. Barto, Sridhar Mahadevan (2003). Recent Advances in Hierarchical Reinforcement Learning. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 4177, 2003. (pdf)
Algorithms of Reinforcement Learning

Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.





Comments (0)
You don't have permission to comment on this page.