Autonomous Driving | VAD: Vectorized Scene Representation for Efficient Autonomous Driving

本文是地平线发表的一篇论文，放弃栅格而使用向量表达场景信息，提出了一个端到端模型，在nuScenes上取得了sota效果
code：https://github.com/hustvl/VAD

1. Introduction

一些端到端模型直接学习原始的传感器数据，输出规划结果，没有进行场景表征，这会导致缺少可解释性；
栅格化虽然简单，但是损失了实例级的信息，计算量大
VAD模型实现了sota的算法效果，降低了碰撞概率，运行速度更快；

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020

Perception

BEV表征的模型：
- LSS： Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- BEVFormer： Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022
- MapTR：Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022

Motion Prediction

Vip3d: End-toend visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022
Fiery: Future instance prediction in bird’seye view from surround monocular cameras. In ICCV, 2021
Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.

Planning

忽略感知和运动预测，直接预测规划轨迹和控制信号，直接而简单，缺少可解释性，难优化
- Exploring the limitations of behavior cloning for autonomous driving. In ICCV, 2019
- Multimodal fusion transformer for end-to-end autonomous driving. In CVPR, 2021
强化学习方法
- Gri: General reinforced imitation and its application to vision-based autonomous driving. arXiv preprint arXiv:2111.08575, 2021.
- Learning to drive from a world on rails. In ICCV, 2021
- End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
dense cost map
Mp3: A unified model to map, perceive, predict and plan. In CVPR, 2021.
Lookout: Diverse multi-future prediction and planning for self-driving. In ICCV, 2021
Plant: Explainable planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022.

3. Method

VAD用查询BEV特征的方式获得各个元素的向量表征

3.1. Vectorized Scene Learning

Vectorized Map

VAD使用map query 从BEV特征中获取地图信息，并越策地图向量，其中，表示预测地图向量的数量，表示每个地图向量中点的数量；
只标记三种地图元素：
- lane divider
- road boundary
- pedestrian crossing

Vectorized Agent Motion

a group of agent queries

3.2. Planning via Interaction

Ego-Agent Interaction

随机初始化的ego query ，和agent queries在transformer decoder中交互
参数说明：
ego position
agent positions
a single layer MLP
query position embedding
key position embedding

Ego-Map Interaction

updated ego query
map queries

Planning Head

driving commands
- turn left
- turn right
- go straight
planning trajectory
the current status of the ego vehicle (optional) as ego features

Ego-Agent Collision Constraint

Specifically, we first filter out low-confidence agent predictions by a threshold
使用多模态预测轨迹中概率最高的轨迹为最终输出

其中，表示计算的是横向或纵向的指标，表示到最近的车辆的距离，表示距离阈值；表示碰撞loss

Ego-Boundary Overstepping Constraint

该项loss使得车辆保持在可行驶区域内行驶

其中， is the map boundary threshold；表示第t个点到最近车道线的距离；

Ego-Lane Directional Constraint

first, we filter out low-confidence map predictions according to
Then we find the closest lane divider vector
Finally, the loss for this constraint is the angular difference averaged over time between the lane vector and the ego vector最后，该约束的损失是车道矢量和ego矢量之间的角度差随时间的平均值: 其中，表示规划的自车向量；表示向量和之间的角度误差

3.4. End-to-End Learning

Vectorized Scene Learning Loss

使用曼哈顿距离（Manhattan distance）度量地图点的回归损失，predicted map points and the ground truth map points
使用focal loss求地图分类的损失，总的map loss定义为
对于运动预测，使用损失，预测智能体分布，再用focal loss预测agent类别，对于预测轨迹，最小化 minFDE作为场景表征的预测；再用损失计算生成的轨迹和gt轨迹的回归误差，再用focal loss作为多模态运动分类的损失，所有的运动预测损失表示为

Vectorized Constraint Loss

该损失包括3.2提到的三个损失：和他车距离，车道边界，行驶方向

Imitation Learning Loss

预测轨迹和gt轨迹对比计算模仿损失：最终，在端到端训练过程中的损失为：

4. Experiments

nuScenes 1000 个场景，每个场景20s数据，23类合计1.4M 3Dbounding box；过去2s预测未来3s

总结

一个挺有意思的端到端框架，简单直接，声称效果好于UniAD，具体实现需要参考代码