|
Algorithms of Reinforcement Learning
Page history
last edited
by Satinder Singh 15 years, 7 months ago
The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.
Associative Reinforcement Learning Algorithms
- (A_{R-P} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-374.
Actor-Critic
- (The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. (link)
Temporal Differences (TD)
- (The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44. (pdf)
- (Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123-158. (link)
- Page with links to Theoretical Analyses of TD(lambda)
Q-learning
- (The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
- (Link to an online paper with description of Q-learning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)
- Page with links to Theoretical Analyses of Q-learning
Dyna (shows that learning and planning are related and integral to RL)
- (Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann (link)
Real-Time Dynamic Programming
- (Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)
SARSA (onpolicy method for control)
- (Algorithm) Rummery G. and Niranjan, M. (1994). On-line q-learning using connectionist systems,Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department. (link)
- Page with links to Theoretical Analyses of Sarsa
LSTD
- (A precursor) S. J., and Barto, A. G., Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
- (Algorithm) Boyan, J. A., Technical Update: Least-Squares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).
LSPI (Least Squares Policy Iteration)
- (Algorithm) Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149. (pdf)
Parti-Game
- (Algorithm) Moore A. and Atkeson C. (1995). The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces. In Machine Learning, Vol 21. (link)
Prioritized Sweeping
- (Algorithm) Moore A. and Atkeson C. (1993) Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time, In Machine Learning, Vol 13, pages 103-130. (link)
E^3
- (Main Algorithm) Kearns M. and Singh S. (2002). Near-Optimal Reinforcement Learning in Polynomial Time, In Machine Learning journal, Volume 49, Issue 2, pages 209-232, 2002. (pdf)
- (For Factored MDPs) Kearns M. and Koller D. (1999). Efficient Reinforcement Learning in Factored MDPs. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pages 740--747 (pdf)
- (In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003). Exploration in Metric State Spaces. In ICML. (pdf)
Sparse-Sampling
- (Algorithm) Kearns M., Mansour Y. and Ng A. (1999). A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence Morgan Kaufmann, pages 1324--1331. Also in the journal Machine Learning. (pdf)
Trajectory Tree
- (Algorithm) Kearns M., Mansour Y. and Ng A. (2000). Approximate planning in large POMDPs via reusable trajectories. In NIPS 12, 2000. (pdf)
Policy Search Algorithms
- (PEGASUS) Ng A. and Jordan M. (2000). PEGASUS: A policy search method for large MDPs and POMDPs, In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference. (pdf)
Policy Gradient Algorithms
- (Reinforce) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256. (link)
- (Algorithm) Sutton R., McAllester D., Singh S. and Mansour Y (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12 (NIPS), 2000 (link).
- J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319-350, 2001.
- J. Baxter and P. L. Bartlett. GPOMDP: An on-line algorithm for estimating performance gradients in POMDP's, with applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 41-48, 2000.
- (*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.
Linear Programming based RL methods
- D. P. de Farias and B. Van Roy, A Linear Program for Bellman Error Minimization with Performance Guarantees, In Mathematics of Operations Research. (pdf)
Algorithms for Structured MDPs/POMDPs
- C. Guestrin, M. Hauskrecht and B. Kveton. Solving Factored MDPs with Continuous and Discrete Variables. In the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 2004. (link)
- C. Guestrin, D. Koller, R. Parr, and S. Venkataraman Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research, volume 19, pages 399-468, 2003. (pdf)
- P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin Greedy Linear Value Function Approximation for Factored Markov Decision Processes. In Proceedings ofthe Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, AB, pages 285--291, 2002. (pdf)
- C. Boutilier, R. Dearden, and M. Goldszmidt, Stochastic Dynamic Programming with Factored Representations, Artificial Intelligence, 121(1), pages 49--107, 2000. (link)
Hierarchical Reinforcement Learning Algorithms
- (Options) Sutton, R. S.,Precup, D., Singh, S. (1999) Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. In Artificial Intelligence, vol. 112, pp.181-211 (pdf)
- (MAXQ) Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227-303 (link)
- (HAMs) Ronald Parr and Stuart Russell (1997). Reinforcement Learning with Hierarchies of Machines. NIPS 97 (link)
- (A Survey) Andrew G. Barto, Sridhar Mahadevan (2003). Recent Advances in Hierarchical Reinforcement Learning. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 41-77, 2003. (pdf)
Algorithms of Reinforcement Learning
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.