UnivLogo

Bikramjit Banerjee's Publications

Selected PublicationsAll Sorted by DateAll Classified by Publication Type

PALO Bounds for Reinforcement Learning in Partially Observable Stochastic Games

Roi Ceren, Keyang He, Prashant Doshi, and Bikramjit Banerjee. PALO Bounds for Reinforcement Learning in Partially Observable Stochastic Games. Neurocomputing, 420(2021):36–56, Elsevier, 2020.

Download

[PDF] 

Abstract

A partially observable stochastic game (POSG) is a general model for multiagent decision making under uncertainty. Perkins’ Monte Carlo exploring starts for partially observable Markov decision process (POMDP) (MCES-P) integrates Monte Carlo exploring starts (MCES) into a local search of the policy space to offer an elegant template for model-free reinforcement learning in POSGs. However, multiagent reinforcement learning in POSGs is tremendously more complex than in single agent settings due to the heterogeneity of agents and discrepancy of their goals. In this article, we generalize reinforcement learning under partial observability to self-interested and cooperative multiagent settings under the POSG umbrella. We present three new templates for multiagent reinforcement learning in POSGs. MCES forinteractive POMDP (MCESIP) extends MCES-P by maintaining predictions of the other agent’s actions based on dynamic beliefs over models. MCES for multiagent POMDP (MCES-MP) generalizes MCES-P to the canonical multiagent POMDP framework, with a single policy mapping joint observations of all agents to joint actions. Finally, MCES for factored-reward multiagent POMDP (MCES-FMP) has each agent individually mapping joint observations to their own action. We use probabilistic approximate locally optimal (PALO) bounds to analyze sample complexity, thereby instantiating these templates to PALO learning. We promote sample efficiency by including a policy space pruning technique and evaluate the approaches on six benchmark domains as well as compare with the state-of-the-art techniques, which demonstrates that MCES-IP and MCES-FMP yield improved policies with fewer samples compared to the previous baselines.

BibTeX

@Article{Ceren20:PALO,
  author = 	 {Roi Ceren and Keyang He and Prashant Doshi and Bikramjit Banerjee},
  title = 	 {PALO Bounds for Reinforcement Learning in Partially Observable


                 Stochastic Games},
  journal = 	 {Neurocomputing},
  year = 	 {2020},
  volume = 	 {420},
  number = 	 {2021},
  pages = 	 {36--56},
  publisher =    {Elsevier},
  abstract =     {A partially observable stochastic game (POSG) is
  a general model for multiagent decision making under uncertainty.
  Perkins’ Monte Carlo exploring starts for partially observable
  Markov decision process (POMDP) (MCES-P) integrates Monte Carlo
  exploring starts (MCES) into a local search of the policy space to
  offer an elegant template for model-free reinforcement learning in
  POSGs. However, multiagent reinforcement learning in POSGs is
  tremendously more complex than in single agent settings due to the
  heterogeneity of agents and discrepancy of their goals. In this article,
  we generalize reinforcement learning under partial observability to
  self-interested and cooperative multiagent settings under the POSG
  umbrella. We present three new templates for multiagent reinforcement
  learning in POSGs. MCES forinteractive POMDP (MCESIP) extends MCES-P
  by maintaining predictions of the other agent’s actions based on
  dynamic beliefs over models. MCES for multiagent POMDP (MCES-MP)
  generalizes MCES-P to the canonical multiagent POMDP framework, with
  a single policy mapping joint observations of all agents to joint
  actions. Finally, MCES for factored-reward multiagent POMDP (MCES-FMP)
  has each agent individually mapping joint observations to their own
  action. We use probabilistic approximate locally optimal (PALO) bounds
  to analyze sample complexity, thereby instantiating these templates to
  PALO learning. We promote sample efficiency by including a policy
  space pruning technique and evaluate the approaches on six benchmark
  domains as well as compare with the state-of-the-art techniques,
  which demonstrates that MCES-IP and MCES-FMP yield improved policies
  with fewer samples compared to the previous baselines.},
}

Generated by bib2html.pl (written by Patrick Riley ) on Sat May 29, 2021 15:48:22