Overview
    
      Model-free reinforcement learning (RL), in particular \(Q\)-learning is widely used to learn optimal policies for
      a
      variety of planning and control problems. However, when the underlying state-transition dynamics are stochastic
      and high-dimensional, \(Q\)-Learning requires a large amount of data and incurs a prohibitively high computational
      cost. In this work, we introduce Hamiltonian \(Q\)-Learning, a data efficient modification of the \(Q\)-learning
      approach, which adopts an importance-sampling based technique for computing the \(Q\) function. To exploit
      stochastic structure of the state-transition dynamics, we employ Hamiltonian Monte Carlo to update \(Q\) function
      estimates by approximating the expected future rewards using \(Q\) values associated with a subset of next states.
      Further, to exploit the latent low-rank structure of the dynamic system, Hamiltonian \(Q\)-Learning uses a matrix
      completion algorithm to reconstruct the updated \(Q\) function from \(Q\) value updates over a much smaller subset
      of
      state-action pairs. By providing an efficient way to apply \(Q\)-learning in stochastic, high-dimensional
      problems,
      the proposed approach broadens the scope of RL algorithms for real-world applications, including classical control
      tasks and environmental monitoring.
    
    
       
      
        Results are provided for a cartpole system. Figures (a), (b) and (c) show policy heat maps for
        \(Q\)-Learning with exhaustive sampling, Hamiltonian \(Q\)-Learning and \(Q\)-Learning with IID sampling,
        respectively. Figure (d) provides a comparison for convergence of \(Q\) function with
        Hamiltonian \(Q\)-Learning and \(Q\)-Learning with IID sampling.
      
     
    
    
      Application to adaptive ocean sampling
      
        Ocean sampling plays a major role in a variety of science and engineering problems, ranging from modeling marine
        ecosystems to predicting global climate. Here, we consider the problem of using an under water glider to obtain
        measurements of a scalar field (e.g., temperature, salinity or concentration of a certain zooplankton) and
        illustrate how the use of Hamiltonian \(Q\)-Learning in planning the glider trajectory can lead to measurements
        that minimize the uncertainty associated with the field.
        
           
          
            Figures (a), (b) and (c) show policy heat maps for
            \(Q\)-Learning with exhaustive sampling, Hamiltonian \(Q\)-Learning and \(Q\)-Learning with IID sampling,
            respectively. Figure (d) provides a comparison for convergence of \(Q\) function with
            Hamiltonian \(Q\)-Learning and \(Q\)-Learning with IID sampling.
          
         
      
     
    
    
      Bibtex
      
        
@article{madhushani2020QLearning,
  title={Hamiltonian Q-Learning: Leveraging Importance-sampling for Data Efficient RL},
  author={Madhushani, Udari and Dey, Biswadip and Leonard, Naomi Ehrich and Chakraborty, Amit},
  conference={under review},
  year={2020}
}