Txing

欢迎来到 | 伽蓝之堂

0%

Autonomous Driving | Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients

Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients

  • 取得了城市驾驶场景中最好的效果(Urban driving scenarios)
  • 数据:使用100小时的城市道路专家示教数据
  • 不必添加复杂的状态扰动;
  • 不必在训练中收集额外的同策略数据;

Introduction

本文的闭环训练算法概览
  • 工业界最好轨迹规划器文献:
    • H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong. Baidu apollo em motion planner. ArXiv, 2018.
  • 本文主要贡献:
    • 复杂城市驾驶场景中,第一个证明了用策略梯度学习,可以从大量真实世界演示数据中学习模仿驾驶策略;
    • 一个新的可微分仿真器,可基于过去的数据进行闭环仿真,并通过时间的反向传播计算策略梯度,实现快速学习;
    • 单纯在仿真器中训练可在真实世界中控制自动驾驶车辆,优于其他方法;
    • 源码可得:https://planning.l5kit.org.
  • Trajectory-based optimization

    • 这是当前工业界的主流方法(a dominant approach)

    • 依赖手工定义的损失和奖励

    • 损失的优化可结合一系列经典的算法:

      • A* [11]
      • RRTs [12]
      • POMDP with solver [13]
      • dynamic programming [14]
    • 整体上是依赖human engineering,而不是数据驱动

  • Reinforcement learning(RL)

    • 依赖仿真器的构造、精确编码和优化的奖励信号
      • S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ArXiv, 2016.
    • 手工编程的仿真器,不能还原真实的长尾场景
    • 本文直接通过 mid-level representations 从真实世界的 log 中构建仿真环境
  • Imitation learning (IL) and Inverse Reinforcement Learning (IRL)

    • 原始的行为克隆(Naive behavioral cloning)面临协变量偏移问题(covariate shift)
    • Adversarial Imitation Learning [31, 32, 33],还没有在自动驾驶场景使用
  • Neural Motion Planners

    • 在[34]中,原始感觉输入和高清地图被用于估计未来可能的SDV位置的成本量。基于这些成本量,可以对轨迹进行采样,并且选择最低成本的轨迹来执行。这些方法目前没有在实车测试。
      • W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun. End-to-end interpretable neural motion planner. Int. Conference on Computer Vision and Pattern Recognition (CVPR), \2019.
      • S. Casas, A. Sadat, and R. Urtasun. Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14403–14412, 2021.
  • Mid-representations and the availability of large-scale real-world AD datasets

    • J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari, V. Iglovikov, and P. Ondruska. One thousand and one hours: Self-driving motion prediction dataset. Conference on Robot Learning (CoRL), 2020.
    • M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, P. C. De Wang, S. Lucey, D. Ramanan, and J. Hays. Argoverse: 3d tracking and forecasting with rich maps supplementary material. Int. Conf. on Computer Vision and Pattern Recognition (CVPR).
    • state-of-the-art solutions for motion forecasting [8, 9]
      • [8] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
      • [9] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun. Learning lane graph representations for motion forecasting. 2020.
  • Data-driven simulation

    • [23] created a photo-realistic simulator for training an end-to-end RL policy.
    • [5] simulated a bird’s-eye view of dense traffic on a highway.
    • Finally, two recent works [39, 40] developed data-driven simulators and showed their usefulness for training and validating ML planners.

Differentiable Traffic Simulator from Real-world Driving Data

  • 真实世界的经验轨迹:
  • 仿真的目标是迭代地生成观测状态序列,然后计算车辆轨迹,包括

Imitation Learning Using a Differentiable Simulator

  • 是专家策略,是模型的策略,希望两个策略接近
Imitation learning from expert demonstrations

P.S.:由于轨迹的开始阶段均来自专家策略,会引入bias,在策略更新的时候,在运动开始的第K步之后才计算梯度,以此避免bias

  • 策略梯度的计算(用下标表示偏微分,是策略参数): > Ref: N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, \2015. ### Experiments

  • Lyft Motion Prediction Dataset [6]:数据采集自加利福尼亚州帕洛阿尔托的复杂城市路线。数据集捕捉各种真实世界的情况,例如在多车道交通中驾驶、转弯、在十字路口与车辆互动等。

    • J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari, V. Iglovikov, and P. Ondruska. One thousand and one hours: Self-driving motion prediction dataset. Conference on Robot Learning (CoRL), 2020.
  • 模型在100小时子集上训练,并在25小时子集上测试。

  • three state-of-the-art baselines:

    • Naive Behavioral Cloning (BC)
    • Behavioral Cloning + Perturbations (BC-perturb)
      • M. Bansal, A. Krizhevsky, and A. Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. 12 2018.
    • Multi-step Prediction (MS Prediction)
      • A. Venkatraman, M. Hebert, and J. Bagnell. Improving multi-step prediction of learned time series models. In AAAI, 2015.
性能对比

指标值越小越好,本文模型取得最好的表现以及最低的l1K指标(综合其它指标,每1000英里干预次数)

  • 评价指标:
    • L2: L2 distance to the underlying expert position in the driving log in meters.
    • Off-road events: we report a failure if the planner deviates more than 2m laterally from the reference trajectory – this captures events such as running off-road and into opposing traffic.
    • Collisions: collisions of the SDV with any other agent, broken down into front, side and rear collisions w.r.t. the SDV.
    • Comfort: we monitor the absolute value of acceleration, and raise a failure should this exceed 3 m/s2.
    • I1K: we accumulate safety-critical failures (collisions and off-road events) into one key metric for ease of comparison, namely Interventions per 1000 Miles (I1K)
仿真结果

总结

策略梯度的推导部分可以继续看看,本文有仿真和实车实验,但方法对比上,对其它算法进行了修改,因此并不完整。