Bikramjit Banerjee's Publications

• Selected Publications • All Sorted by Date • All Classified by Publication Type •

On-Policy Concurrent Reinforcement Learning

Bikramjit Banerjee, Sandip Sen, and Jing Peng. On-Policy Concurrent Reinforcement Learning. Journal of Experimental and Theoretical Artificial Intelligence, 16(4):245 – 260, 2004.

Download

[PDF]

Abstract

When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.

BibTeX

@Article{Banerjee04:On-Policy,
  author = 	 {Bikramjit Banerjee and Sandip Sen and Jing Peng},
  title = 	 {On-Policy Concurrent Reinforcement Learning},


  journal = 	 {Journal of Experimental and Theoretical Artificial
                 Intelligence},
  year = 	 {2004},
  volume = 	 {16},
  number = 	 {4},
  pages = 	 {245 - 260},
  abstract = {When an agent learns in a multi-agent environment, the
   payoff it receives is dependent on the behaviour of the other agents.
   If the other agents are also learning, its reward distribution becomes
   non-stationary. This makes learning in multi-agent systems more
   difficult than single-agent learning. Prior attempts at value-function
   based learning in such domains have used off-policy Q-learning that do
   not scale well as the cornerstone, with restricted success. This paper
   studies on-policy modifications of such algorithms, with the promise of
   scalability and efficiency. In particular, it is proven that these
   hybrid techniques are guaranteed to converge to their desired fixed
   points under some restrictions. It is also shown, experimentally, that
   the new techniques can learn (from self-play) better policies than the
   previous algorithms (also in self-play) during some phases of the
   exploration.},
}

Generated by bib2html.pl (written by Patrick Riley ) on Sat May 29, 2021 15:48:22