| 
View
 

Algorithms of Reinforcement Learning

This version was saved 15 years, 8 months ago View current version     Page history
Saved by Satinder Singh
on March 21, 2009 at 2:56:21 pm
 

The ambition of this page is to be a comprehensive collection of links to papers describing RL algorithms. In order to make this list manageable we should only consider RL algorithms that originated a class of algorithms and have been used/studied by at least one researcher(s) unaffiliated with the original inventor(s) of the algorithm.


Associative Reinforcement Learning Algorithms

  • (A_{R-P} )Barto, AG & Anandan, P (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-374.

Actor-Critic

  • (The foundational paper) Barto, A.G., Sutton, R.S., & Anderson, C. (1983). Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13: 834-846. (link)

 

Temporal Differences (TD)

  • (The foundational paper) Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3: 9-44. (pdf)
  • (Introduced Replace Traces) Singh S. & Sutton R. (1996). Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning journal, Volume 22, Issue 1, pages 123-158. (link)
  • Page with links to Theoretical Analyses of TD(lambda)

 

Q-learning

  • (The foundational thesis) C. J. Watkins. Learning from Delayed Rewards. Phd thesis, Cambridge University, 1989.
  • (Link to an online paper with description of Q-learning) Andrew Barto, Steve Bradtke and Satinder Singh (1995). [[][Learning to Act using Real-Time Dynamic Programming]]. In Artificial Intelligence, Volume 72, pages 81-138. (link)
  • Page with links to Theoretical Analyses of Q-learning

     

Dyna        (shows that learning and planning are related and integral to RL)

  • (Algorithm) R. S. Sutton (1991). Planning by incremental dynamic programming. Proceedings of the Eighth International Workshop on Machine Learning, pp. 353-357, Morgan Kaufmann (link)

 

Real-Time Dynamic Programming

  • (Algorithm) Andrew Barto, Steve Bradtke and Satinder Singh (1995). Learning to Act using Real-Time Dynamic Programming. In Artificial Intelligence, Volume 72, pages 81-138. (link)

 

SARSA     (onpolicy method for control)

  • (Algorithm) Rummery G. and Niranjan, M. (1994). On-line q-learning using connectionist systems,Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department. (link)
  • Page with links to Theoretical Analyses of Sarsa

 

LSTD

  • (A precursor) S. J., and Barto, A. G., Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22, 1996, pp. 3357.
  • (Algorithm) Boyan, J. A., Technical Update: Least-Squares Temporal Difference Learning, In Machine Learning, 49, 2002. (An ICML version is here).

 

LSPI     (Least Squares Policy Iteration)

  • (Algorithm)  Michail Lagoudakis and Ronald Parr, Least Squares Policy Iteration. Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149. (pdf)

 

Parti-Game

                1 (Algorithm) Moore A. and Atkeson C. (1995). [[http://www.autonlab.org/autonweb/documents/papers/moore-partigame.

ps][The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces]]. In Machine Learnin

g, Vol 21.

 

Prioritized Sweeping

                1 (Algorithm) Moore A. and Atkeson C. (1993) [[http://www.autonlab.org/autonweb/documents/papers/mr:prrt.ps][Prior

itized Sweeping: Reinforcement Learning with Less Data and Less Real Time]], In Machine Learning, Vol 13, pages 103-130.

E^3

                1 (Main Algorithm) Kearns M. and Singh S. (2002). [[http://www.eecs.umich.edu/~baveja/Papers/MLjournal6.pdf][Near-

Optimal Reinforcement Learning in Polynomial Time]], In Machine Learning journal, Volume 49, Issue 2, pages 209-232, 2002.

                1 (For Factored MDPs) Kearns M. and Koller D. (1999). [[http://www.cis.upenn.edu/~mkearns/papers/dbne3.pdf][Effici

ent Reinforcement Learning in Factored MDPs]].  Proceedings of the Sixteenth International Joint  Conference on Artificial Intelli

gence,  Morgan Kaufmann, pages 740--747

                1 (In Metric Spaces) Kakade S., Kearns M. and Langford J. (2003).  [[http://www.cis.upenn.edu/~mkearns/papers/metr

icE3.pdf][Exploration in Metric State Spaces]].  In ICML.

        

 

Sparse-Sampling (drives home the point that there is a space of algorithms with radical tradeoffs to be made)

                1 (Algorithm) Kearns M., Mansour Y. and Ng A. (1999). [[http://www.robotics.stanford.edu/~ang/papers/ijcai99-large

mdp.pdf][A Sparse Sampling Algorithm for Near-Optimal  Planning in Large Markov Decision Processes]]. Proceedings of the Sixteenth

 International Joint  Conference on Artificial Intelligence  Morgan Kaufmann, pages 1324--1331.  Also in the journal  Machine Lear

ning.

 

Trajectory Tree

                1 (Algorithm) Kearns M., Mansour Y. and Ng A. (2000). [[http://www.robotics.stanford.edu/~ang/papers/pomdp-long.pd

f][Approximate planning in large POMDPs via reusable trajectories]]. In NIPS 12, 2000.

 

Policy Search Algorithms

                1 (*PEGASUS*) Ng A. and Jordan M. (2000). [[http://www.robotics.stanford.edu/~ang/papers/uai00-pegasus.pdf][PEGASU

S: A policy search method for large MDPs and POMDPs]], In Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Con

ference.

Policy Gradient Algorithms

                1 (Reinforce) Williams, R. J. (1992). [[ftp://ftp.ccs.neu.edu/pub/people/rjw/conn-reinf-ml-92.ps][Simple statistic

al gradient-following algorithms for connectionist reinforcement learning]]. Machine Learning, 8, 229-256.

                1 (Algorithm) Sutton R., <nop>McAllester D., Singh S. and Mansour Y (2000). [[http://www.eecs.umich.edu/~baveja/Pa

pers/PolicyGradientNIPS99.ps.gz][Policy Gradient Methods for Reinforcement Learning with Function Approximation]]. In Advances in

Neural Information Processing Systems 12 (NIPS), 2000

                1 J. Baxter and P. L. Bartlett.  Infinite-horizon gradient-based policy search.  Journal of Artificial Intelligenc

e Research, 15:319-350, 2001.

                1 J. Baxter and P. L. Bartlett.  GPOMDP: An on-line algorithm for estimating performance gradients  in POMDP's, wi

th applications.  In Proceedings of the 2000 International Conference on Machine  Learning, pages 41-48, 2000.

                1 (*Natural Policy Gradient Algorithms*) S. Kakade, J. Bagnell, and J. Peters.

                        * [[http://www.cis.upenn.edu/~skakade/papers/rl/natural.ps][Natural Policy Gradient]]            

                        * [[http://www.autonlab.org/autonweb/documents/papers/bagnellCovariant.pdf][Covariant Reinforce]]

                        * [[http://www-2.cs.cmu.edu/~nickr/nips_workshop/jpeters.abstract.pdf][Natural Actor Critic]]

 

 

Linear Programming based RL methods

                1 D. P. de Farias and B. Van Roy, [[http://www.stanford.edu/~bvr/psfiles/average-LP.pdf][A Linear Program for Bell

man Error Minimization with Performance Guarantees]], submitted to Mathematics of Operations Research.

 

Algorithms for Structured MDPs/POMDPs

                1 C. Guestrin, M. Hauskrecht and B. Kveton [[http://research.microsoft.com/uai2004/][Solving Factored MDPs with Co

ntinuous and Discrete Variables]]. In the Twentieth Conference on  Uncertainty in Artificial Intelligence, Banff, Canada, July 200

4.

                2 C. Guestrin, D. Koller, R. Parr, and S. Venkataraman [[http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume19/gu

estrin03a.pdf][Efficient Solution Algorithms for Factored MDPs]], Journal of Artificial Intelligence Research, volume 19, pages 39

9-468, 2003.

                3 P. Poupart, R. Patrascu, D. Schuurmans, C. Boutilier, and C Guestrin [[http://www.cs.toronto.edu/~cebly/Papers/_

download_/basislinear.pdf][Greedy Linear Value Function Approximation for Factored Markov Decision Processes]]. In Proceedings of

the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, AB, pages 285--291, 2002.

                4 C. Boutilier, R. Dearden, and M. Goldszmidt, [[http://www.cs.toronto.edu/~cebly/Papers/sdp.ps][Stochastic Dynami

c Programming with Factored Representations]], Artificial Intelligence, 121(1), pages 49--107, 2000.

 

Hierarchical Reinforcement Learning Algorithms

                1 (_Options_) Sutton, R. S.,Precup, D., Singh, S. (1999) [[%ATTACHURL%/Sutton-Precup-Singh-AIJ99.pdf][Between MDPs

 and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning]]. In Artificial Intelligence, vol. 112, pp.181-211

. (Also see [[http://rlai.cs.ualberta.ca/RLAI/options.html][list of papers on options]])

                1 (_MAXQ_) Dietterich, T. G. (2000). [[http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume13/dietterich00a.ps.Z][

Hierarchical reinforcement learning with the MAXQ value function decomposition]]. Journal of Artificial Intelligence Research, 13,

 227-303

                1 (_HAMs_) Ronald Parr and Stuart Russell (1997). [[http://www.cs.duke.edu/~parr/ham-nips97.ps.gz][Reinforcement L

earning with Hierarchies of Machines]]. NIPS 97

                1 (_Survey_) Andrew G. Barto, Sridhar Mahadevan (2003) [[%ATTACHURL%/barto03recent.pdf][Recent Advances in Hierarc

hical Reinforcement Learning]]. Special Issue on Reinforcement Learning, Discrete Event Systems journal, pp. 41-77, 2003.

Comments (0)

You don't have permission to comment on this page.