Txing

欢迎来到 | 伽蓝之堂

0%

Autonomous Driving | Model-based offline planning

Model-based offline planning

  • 由于成本、安全性等因素,很多情况下不能够直接与系统交互来学习控制策略,因此,只能从记录的log数据中学习控制策略(offline reinforcement learning)。本文介绍了一种从log数据中学到超越成圣log数据的原策略的新策略的方法,命名为 model-based ofline planning (MBOP)。

Introduction

  • Offline reinforcement learning包括:
    • model-free方法:
      • Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. CoRR, abs/1911.11361, 2019. URL http://arxiv.org/abs/1911.11361.
      • Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
      • Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062, 2019.
      • Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33, 2020.
    • model-based方法:MOPO, MoREL学习一个模型,然后用于训练一个无模型策略,这种模式和Dyna模式类似。
      • MOPO: Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
      • MoREL: Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
      • Dyna: Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • 本文的算法属于model-based,利用model-predictive control (MPC) ,扩展MPPI轨迹规划器,并使用实时规划,产生目标或满足奖励条件的策略。
    • Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1714–1721. IEEE, 2017b.
  • 本文模型MBOP包含三个要素:
    • a learnt world model,
    • a learnt behavior-cloning policy,
    • a learnt fixed-horizon value-function.
  • MBOP的核心优势是数据高效自适应。只需仅100秒就可以训练出一个和奖励函数、目标状态、基于状态的约束相适应的策略。
  • MBOP能够对非平稳目标和约束执行zero-shot自适应,但是没有处理非平稳动力学特性的机制。

Model-based offline planning

  • 描述问题为Markov Decision Process (MDP),
    • 是系统状态
    • 是行为
    • 是状态转移概率
    • 是奖励
    • 是时间折扣系数
  • MBOP包括三个函数近似器:
    • :环境动力学的单步模型,,本文使用表示状态预测,使用表示奖励预测。
    • :表示一个行为克隆网络,,被规划算法用来引导轨迹采样的先验。
    • :是一个阉割的值函数,提供在状态s中采取行为a后,在固定界限上的收益。
  • MBOP-POLICY
    • 使用MPC输出每个新状态下的行为()。MPC在每一时间步执行一个固定长度的规划,返回长度为H的轨迹T。选择该轨迹的第一个行为并返回。
High-Level MBOP-Policy
  • MBOP-TRAJOPT
    • 在PDDM的基础上增加一个策略先验和价值预测
MBOP-Trajopt

P.S.:第11行在给出的行为上加权了采样轨迹的行为,其含义可能是希望在网络没有收敛时,记录下来的行为也不要偏差太大,都保持在采样轨迹附近,参数可被视为学习率。第17行给出多条轨迹中奖励最大的作为输出(re-weighting)

Experimental results

  • 首先,在非常少的数据中心训练,其次,再迁移到基于相同系统动力学的两种novel tasks中:

    • goal-conditioned tasks (that ignore the original reward function)
    • constrained tasks (that require optimising for the original reward under some state constraint)
  • 使用的数据集RL Unplugged (RLU) 和 D4RL

    • RL Unplugged (RLU):Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Kon- ´rad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888, 2020.
      • cartpole-swingup
      • walker
      • quadruped
    • D4RL:Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
      • halfcheetah
      • hopper
      • walker2d
      • Adroit
  • 对于 RLU 中的 Quadruped 和 Walker 任务,由于数据集中性能高方差,在训练 的过程中,通过设定阈值,舍弃了性能不好的数据。 使用未过滤的数据来训练

  • 对于所有的数据集,90%用于训练,10%用于测试验证

  • 性能:For the RLU datasets (Fig. 1), we observe that MBOP is able to find a near-optimal policy on most dataset sizes in Cartpole and Quadruped with as little as 5000 steps, which corresponds to 5 episodes, or approximately 50 seconds on Cartpole and 100 seconds on Quadruped. On the Walker datasets MBOP requires 23 episodes (approx. 10 minutes) before it finds a reasonable policy, and with sufficient data converges to a score of 900 which is near optimal. On most tasks, MBOP is able to generate a policy significantly better than the behavior data as well as the the BC prior.

  • MBOP模型容易适应新的目标函数,例如添加新的子目标函数时, 其中,是用户自定义的目标函数。只需要将轨迹更新规则改为:

  • 为了验证上述模型的适应能力,进行了两个实验:

    • goal-conditioned control(忽略原始奖励,,学习新奖励)
    • constrained control(增加了state-based constraint,然后探索合适的
ZERO-SHOT TASK ADAPTATION

总结

MBOP为策略生成提供了一种易于实施、数据高效、稳定且灵活的算法。

由于使用了在线规划,使其能够应对变化的目标、成本和环境限制。

但是算法没有在更复杂的场景和约束条件下测试,因此适用范围和效果还缺少验证。