The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.

Associative Reinforcement Learning Algorithms

(A_{R-P} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-374.

Actor-Critic

(The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. (link)

Temporal Differences (TD)

(The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44. (pdf)
(Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123-158. (link)
Page with links to Theoretical Analyses of TD(lambda)

Q-learning

(The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
(Link to an online paper with description of Q-learning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)
Page with links to Theoretical Analyses of Q-learning

Dyna (shows that learning and planning are related and integral to RL)

(Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann (link)

Real-Time Dynamic Programming

(Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)

SARSA (onpolicy method for control)

(Algorithm) Rummery G. and Niranjan, M. (1994). On-line q-learning using connectionist systems,Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department. (link)
Page with links to Theoretical Analyses of Sarsa

LSTD

(A precursor) S. J., and Barto, A. G., Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
(Algorithm) Boyan, J. A., Technical Update: Least-Squares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).

LSPI (Least Squares Policy Iteration)

(Algorithm) Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149. (pdf)

Parti-Game

(Algorithm) Moore A. and Atkeson C. (1995). The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces. In Machine Learning, Vol 21. (link)

Prioritized Sweeping

(Algorithm) Moore A. and Atkeson C. (1993) Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time, In Machine Learning, Vol 13, pages 103-130. (link)

E^3

(Main Algorithm) Kearns M. and Singh S. (2002). Near-Optimal Reinforcement Learning in Polynomial Time, In Machine Learning journal, Volume 49, Issue 2, pages 209-232, 2002. (pdf)
(For Factored MDPs) Kearns M. and Koller D. (1999). Efficient Reinforcement Learning in Factored MDPs. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pages 740--747 (pdf)
(In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003). Exploration in Metric State Spaces. In ICML. (pdf)

Sparse-Sampling

(Algorithm) Kearns M., Mansour Y. and Ng A. (1999). A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Morgan Kaufmann, pages 1324--1331. Also in the journal Machine Learning. (pdf)

Trajectory Tree

(Algorithm) Kearns M., Mansour Y. and Ng A. (2000). Approximate planning in large POMDPs via reusable trajectories. In NIPS 12, 2000. (pdf)

Policy Search Algorithms

(PEGASUS) Ng A. and Jordan M. (2000). PEGASUS: A policy search method for large MDPs and POMDPs, In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference. (pdf)

Policy Gradient Algorithms

(Reinforce) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256. (link)
(Algorithm) Sutton R., McAllester D., Singh S. and Mansour Y (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12 (NIPS), 2000 (link).
J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319-350, 2001.
J. Baxter and P. L. Bartlett. GPOMDP: An on-line algorithm for estimating performance gradients in POMDP's, with applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 41-48, 2000.
(*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.

Linear Programming based RL methods

D. P. de Farias and B. Van Roy, A Linear Program for Bellman Error Minimization with Performance Guarantees, In Mathematics of Operations Research. (pdf)

Algorithms for Structured MDPs/POMDPs

C. Guestrin, M. Hauskrecht and B. Kveton. Solving Factored MDPs with Continuous and Discrete Variables. In the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 2004. (link)
C. Guestrin, D. Koller, R. Parr, and S. Venkataraman Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research, volume 19, pages 399-468, 2003. (pdf)
P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin Greedy Linear Value Function Approximation for Factored Markov Decision Processes. In Proceedings ofthe Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, AB, pages 285--291, 2002. (pdf)
C. Boutilier, R. Dearden, and M. Goldszmidt, Stochastic Dynamic Programming with Factored Representations, Artificial Intelligence, 121(1), pages 49--107, 2000. (link)

Hierarchical Reinforcement Learning Algorithms

(Options) Sutton, R. S.,Precup, D., Singh, S. (1999) Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. In Artificial Intelligence, vol. 112, pp.181-211 (pdf)
- Also see list of papers on options
(MAXQ) Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227-303 (link)
(HAMs) Ronald Parr and Stuart Russell (1997). Reinforcement Learning with Hierarchies of Machines. NIPS 97 (link)
(A Survey) Andrew G. Barto, Sridhar Mahadevan (2003). Recent Advances in Hierarchical Reinforcement Learning. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 41-77, 2003. (pdf)

Algorithms of Reinforcement Learning