- 本文是地平线发表的一篇论文,放弃栅格而使用向量表达场景信息,提出了一个端到端模型,在nuScenes上取得了sota效果
- code:https://github.com/hustvl/VAD
1. Introduction
- 一些端到端模型直接学习原始的传感器数据,输出规划结果,没有进行场景表征,这会导致缺少可解释性;
- 栅格化虽然简单,但是损失了实例级的信息,计算量大
- VAD模型实现了sota的算法效果,降低了碰撞概率,运行速度更快;
2. Related Work
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020
Perception
- BEV表征的模型:
- LSS: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- BEVFormer: Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022
- MapTR:Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022
Motion Prediction
Vip3d: End-toend visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022
Fiery: Future instance prediction in bird’seye view from surround monocular cameras. In ICCV, 2021
Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
Planning
- 忽略感知和运动预测,直接预测规划轨迹和控制信号,直接而简单,缺少可解释性,难优化
- Exploring the limitations of behavior cloning for autonomous driving. In ICCV, 2019
- Multimodal fusion transformer for end-to-end autonomous driving. In CVPR, 2021
- 强化学习方法
- Gri: General reinforced imitation and its application to vision-based autonomous driving. arXiv preprint arXiv:2111.08575, 2021.
- Learning to drive from a world on rails. In ICCV, 2021
- End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
- dense cost map
- Mp3: A unified model to map, perceive, predict and plan. In CVPR, 2021.
- Lookout: Diverse multi-future prediction and planning for self-driving. In ICCV, 2021
- Plant: Explainable planning transformers via object-level representations. arXiv preprint arXiv:2210.14222, 2022.
3. Method
- VAD用查询BEV特征的方式获得各个元素的向量表征
3.1. Vectorized Scene Learning
Vectorized Map
- VAD使用map query
从BEV特征中获取地图信息,并越策地图向量 ,其中, 表示预测地图向量的数量, 表示每个地图向量中点的数量; - 只标记三种地图元素:
- lane divider
- road boundary
- pedestrian crossing
Vectorized Agent Motion
a group of agent queries
3.2. Planning via Interaction
Ego-Agent Interaction
- 随机初始化的ego query
,和agent queries在transformer decoder中交互 - 参数说明:
- ego position
- agent positions
- a single layer MLP
- query position embedding
- key position embedding
Ego-Map Interaction
- updated ego query
- map queries
Planning Head
driving commands
- turn left
- turn right
- go straight
planning trajectory
the current status of the ego vehicle
(optional) as ego features
Ego-Agent Collision Constraint
Specifically, we first filter out low-confidence agent predictions by a threshold
使用多模态预测轨迹中概率最高的轨迹为最终输出
其中,
Ego-Boundary Overstepping Constraint
- 该项loss使得车辆保持在可行驶区域内行驶
其中,
Ego-Lane Directional Constraint
first, we filter out low-confidence map predictions according to
Then we find the closest lane divider vector
Finally, the loss for this constraint is the angular difference averaged over time between the lane vector and the ego vector最后,该约束的损失是车道矢量和ego矢量之间的角度差随时间的平均值:
其中, 表示规划的自车向量; 表示向量 和 之间的角度误差
3.4. End-to-End Learning
Vectorized Scene Learning Loss
使用曼哈顿距离(Manhattan distance)度量地图点的回归损失,predicted map points and the ground truth map points
使用focal loss求地图分类的损失,总的map loss定义为
对于运动预测,使用
损失,预测智能体分布,再用focal loss预测agent类别,对于预测轨迹,最小化 minFDE作为场景表征的预测; 再用 损失计算生成的轨迹和gt轨迹的回归误差,再用focal loss作为多模态运动分类的损失,所有的运动预测损失表示为
Vectorized Constraint Loss
该损失包括3.2提到的三个损失:和他车距离,车道边界,行驶方向
Imitation Learning Loss
预测轨迹和gt轨迹对比计算模仿损失:
4. Experiments
nuScenes 1000 个场景,每个场景20s数据,23类合计1.4M 3Dbounding box;过去2s预测未来3s
总结
- 一个挺有意思的端到端框架,简单直接,声称效果好于UniAD,具体实现需要参考代码