The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.
Associative Reinforcement Learning Algorithms
- (A_{R-P} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-374.
Actor-Critic
- (The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. (link)
Temporal Differences (TD)
- (The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44. (pdf)
- (Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123-158. (link)
- Page with links to Theoretical Analyses of TD(lambda)
Q-learning
- (The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
- (Link to an online paper with description of Q-learning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). [[][Learning to Act using Real-Time Dynamic Programming]]. In Artificial Intelligence, Volume 72, pages 81-138. (link)
- Page with links to Theoretical Analyses of Q-learning
Dyna (shows that learning and planning are related and integral to RL)
- (Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann (link)
Real-Time Dynamic Programming
- (Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)
SARSA (onpolicy method for control)
- (Algorithm) Rummery G. and Niranjan, M. (1994). On-line q-learning using connectionist systems,Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department. (link)
- Page with links to Theoretical Analyses of Sarsa
LSTD
- (A precursor) S. J., and Barto, A. G., Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
- (Algorithm) Boyan, J. A., Technical Update: Least-Squares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).
LSPI (Least Squares Policy Iteration)
- (Algorithm) Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149. (pdf)
Parti-Game
1 (Algorithm) Moore A. and Atkeson C. (1995). [[http://www.autonlab.org/autonweb/documents/papers/moore-partigame.
ps][The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces]]. In Machine Learnin
g, Vol 21.
Prioritized Sweeping
1 (Algorithm) Moore A. and Atkeson C. (1993) [[http://www.autonlab.org/autonweb/documents/papers/mr:prrt.ps][Prior
itized Sweeping: Reinforcement Learning with Less Data and Less Real Time]], In Machine Learning, Vol 13, pages 103-130.
E^3
1 (Main Algorithm) Kearns M. and Singh S. (2002). [[http://www.eecs.umich.edu/~baveja/Papers/MLjournal6.pdf][Near-
Optimal Reinforcement Learning in Polynomial Time]], In Machine Learning journal, Volume 49, Issue 2, pages 209-232, 2002.
1 (For Factored MDPs) Kearns M. and Koller D. (1999). [[http://www.cis.upenn.edu/~mkearns/papers/dbne3.pdf][Effici
ent Reinforcement Learning in Factored MDPs]]. Proceedings of the Sixteenth International Joint Conference on Artificial Intelli
gence, Morgan Kaufmann, pages 740--747
1 (In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003). [[http://www.cis.upenn.edu/~mkearns/papers/metr
icE3.pdf][Exploration in Metric State Spaces]]. In ICML.
Sparse-Sampling (drives home the point that there is a space of algorithms with radical tradeoffs to be made)
1 (Algorithm) Kearns M., Mansour Y. and Ng A. (1999). [[http://www.robotics.stanford.edu/~ang/papers/ijcai99-large
mdp.pdf][A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes]]. Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence Morgan Kaufmann, pages 1324--1331. Also in the journal Machine Lear
ning.
Trajectory Tree
1 (Algorithm) Kearns M., Mansour Y. and Ng A. (2000). [[http://www.robotics.stanford.edu/~ang/papers/pomdp-long.pd
f][Approximate planning in large POMDPs via reusable trajectories]]. In NIPS 12, 2000.
Policy Search Algorithms
1 (*PEGASUS*) Ng A. and Jordan M. (2000). [[http://www.robotics.stanford.edu/~ang/papers/uai00-pegasus.pdf][PEGASU
S: A policy search method for large MDPs and POMDPs]], In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Con
ference.
Policy Gradient Algorithms
1 (Reinforce) Williams, R. J. (1992). [[ftp://ftp.ccs.neu.edu/pub/people/rjw/conn-reinf-ml-92.ps][Simple statistic
al gradient-following algorithms for connectionist reinforcement learning]]. Machine Learning, 8, 229-256.
1 (Algorithm) Sutton R., <nop>McAllester D., Singh S. and Mansour Y (2000). [[http://www.eecs.umich.edu/~baveja/Pa
pers/PolicyGradientNIPS99.ps.gz][Policy Gradient Methods for Reinforcement Learning with Function Approximation]]. In Advances in
Neural Information Processing Systems 12 (NIPS), 2000
1 J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligenc
e Research, 15:319-350, 2001.
1 J. Baxter and P. L. Bartlett. GPOMDP: An on-line algorithm for estimating performance gradients in POMDP's, wi
th applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 41-48, 2000.
1 (*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.
* [[http://www.cis.upenn.edu/~skakade/papers/rl/natural.ps][Natural Policy Gradient]]
* [[http://www.autonlab.org/autonweb/documents/papers/bagnellCovariant.pdf][Covariant Reinforce]]
* [[http://www-2.cs.cmu.edu/~nickr/nips_workshop/jpeters.abstract.pdf][Natural Actor Critic]]
Linear Programming based RL methods
1 D. P. de Farias and B. Van Roy, [[http://www.stanford.edu/~bvr/psfiles/average-LP.pdf][A Linear Program for Bell
man Error Minimization with Performance Guarantees]], submitted to Mathematics of Operations Research.
Algorithms for Structured MDPs/POMDPs
1 C. Guestrin, M. Hauskrecht and B. Kveton [[http://research.microsoft.com/uai2004/][Solving Factored MDPs with Co
ntinuous and Discrete Variables]]. In the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 200
4.
2 C. Guestrin, D. Koller, R. Parr, and S. Venkataraman [[http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume19/gu
estrin03a.pdf][Efficient Solution Algorithms for Factored MDPs]], Journal of Artificial Intelligence Research, volume 19, pages 39
9-468, 2003.
3 P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin [[http://www.cs.toronto.edu/~cebly/Papers/_
download_/basislinear.pdf][Greedy Linear Value Function Approximation for Factored Markov Decision Processes]]. In Proceedings of
the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, AB, pages 285--291, 2002.
4 C. Boutilier, R. Dearden, and M. Goldszmidt, [[http://www.cs.toronto.edu/~cebly/Papers/sdp.ps][Stochastic Dynami
c Programming with Factored Representations]], Artificial Intelligence, 121(1), pages 49--107, 2000.
Hierarchical Reinforcement Learning Algorithms
1 (_Options_) Sutton, R. S.,Precup, D., Singh, S. (1999) [[%ATTACHURL%/Sutton-Precup-Singh-AIJ99.pdf][Between MDPs
and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning]]. In Artificial Intelligence, vol. 112, pp.181-211
. (Also see [[http://rlai.cs.ualberta.ca/RLAI/options.html][list of papers on options]])
1 (_MAXQ_) Dietterich, T. G. (2000). [[http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume13/dietterich00a.ps.Z][
Hierarchical reinforcement learning with the MAXQ value function decomposition]]. Journal of Artificial Intelligence Research, 13,
227-303
1 (_HAMs_) Ronald Parr and Stuart Russell (1997). [[http://www.cs.duke.edu/~parr/ham-nips97.ps.gz][Reinforcement L
earning with Hierarchies of Machines]]. NIPS 97
1 (_Survey_) Andrew G. Barto, Sridhar Mahadevan (2003) [[%ATTACHURL%/barto03recent.pdf][Recent Advances in Hierarc
hical Reinforcement Learning]]. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 41-77, 2003.
Comments (0)
You don't have permission to comment on this page.