Autonomous Driving | Planning-oriented Autonomous Driving

感知、预测、规划这样层级式的自动驾驶框架受到累积误差、任务协调效率低下的影响；本文以自动驾驶规划为终极任务，让每个模块都为这个任务服务，构建了 Unified Autonomous Driving (UniAD)模型；各项指标大幅超过之前的工作，是端到端的sota模型；

代码：https://github.com/OpenDriveLab/UniAD

1. Introduction

自动驾驶集成了多个任务：
- 感知：检测、跟踪、匹配 detection, tracking, mapping
- 预测：运动和占据网格预测 motion and occupancy forecast
多数框架使用独立的模块处理不同的任务，多任务学习框架共享一个backbone，用不同的head学习不同的任务；本文框架以规划为导向设计其他模块使模块间配合更高效；
层级式的模块弊端：
- 信息逐层损失
- 误差逐层积累
- 优化目标的隔离导致特征错位（feature misalignment）
- 跨团队协作困难
多任务学习（MTL）一般需要先进行特征抽象，然后扩展到不同任务的学习，但有时这个框架会造成 ”negative transfer”
- Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020.
- BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023.
一些多任务学习模型的任务对比
安全健壮的自动驾驶系统需要的必要的几个任务模块：
- 3d目标检测、目标跟踪、在线建图、运动预测、占据网格预测、轨迹规划
本文通过query的形式连接所有模块，而不是单纯工程性的连接各个模块；这样可以弱化上游的累积误差，便于建模和编码交互
Contributions：
- 规划导向的框架，而不是简单的多任务学习；
- 提出UniAD，用query连接所有节点；
- 通过大量消融实验证明了方法由于之前的方案，达到sota；

2. Methodology

模型使用了一个现成的bev编码器，参考BEVFormer，但UniAD并不局限于这个编码器，可以用同类的编码器替换；也可以进行长时序融合long-term temporal fusion或者多模态融合multi-modality fusion：
- long-term temporal fusion：
  - Exploring recurrent long-term temporal fusion for multi-view 3d perception. arXiv preprint arXiv:2303.05970, 2023.
  - Time will tell: New outlooks and a baseline for temporal multiview 3d object detection. arXiv preprint arXiv:2210.02443, 2022.
- multi-modality fusion：
  - BEVFusion: A simple and robust lidar-camera fusion framework. In NeurIPS, 2022.
  - BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023
- BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022

2.1. Perception: Tracking and Mapping

TrackFormer

该模块同时处理目标检测和多目标跟踪，没有不可微分的后处理
TrackFormer contains layers and the final output state provides knowledge of valid agents for downstream prediction tasks.
we introduce one particular ego-vehicle query in the query set to explicitly model the self-driving vehicle itself, which is further used in planning.
参考：
- End-to-end multiple-object tracking with transformer. In ECCV, 2021
- MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries. In CVPR Workshop, 2022.
- End-to-end object detection with transformers. In ECCV, 2020

MapFormer

稀疏地将道路元素表示为地图查询，以帮助下游的运动预测，并对位置和结构知识进行编码
MapFormer also has stacked layers whose output results of each layer are all supervised, while only the updated queries in the last layer are forwarded to MotionFormer for agent-map interaction.

2.2. Prediction: Motion Forecasting

MotionFormer

从TrackFormer 和 MapFormer中可获得两个高度概括query：
动态智能体query
静态地图query
MotionFormer预测所有智能体未来的多模态轨迹（i.e., top-k possible trajectories, in a scene-centric manner）
- Stopnet: Scalable trajectory and occupancy prediction for urban autonomous driving. arXiv preprint arXiv:2206.00991, 2022
输出的未来轨迹：, 其中，i表述车辆的序号，k表示轨迹模态的序号，T是预测的时间长度；
It is composed of layers, and each layer captures three types of interactions: agent-agent, agent-map and agent-goal point.
对于每个 motion query （为表示方便后续忽略下标），其与其他agents 或者 map elements 的交互可以表示为：其中MHCA和MHSA分别表示 multi-head cross-attention 和 multi-head self-attention
用一个可变形的注意力定义agent-goal point attention：表示上一层输出的预测轨迹的终点，表示一个可变形的注意力模块，输入为query ，参考点，空间特征
三种交互都并行计算，最后生成，再拼接后输入到MLP网络，获得query context ；随后在传到下一层进行优化或者在最后一层被编码为预测的结果

Motion queries

我们将输入到MotionFormer每一层的queries称为Motion queryies，它包含两项：
- query context (由之前的层生成)
- query position (集成了4个位置)
  - 1. 场景级的anchor
  - 1. 智能体级的anchor （gt轨迹终点k-means聚类获得）
  - 1. 智能体当前的位置
  - 1. 预测的目标位置
  $Misplaced & Q_{pos}=&\text{MLP}(\text{PE}(I^s)) + \text{MLP}(\text{PE}(I^a)) \\ &+ \text{MLP}(\text{PE}(\hat{x}_0)) + \text{MLP}(\text{PE}(\hat{x}^{l-1}_T))$
  - 其中，表示正弦位置编码，在第一层的时候设置为

Non-linear Optimization

为了处理感知的轨迹的不确定性（异常曲率或者heading），本文使用了一个非线性优化方法平滑真值轨迹：其中，和分别表示 gt 轨迹和平滑后的轨迹，是通过 multiple-shooting 算法（ A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proceedings Volumes, 1984.）生成的，并且损失函数定义如下：其中，和是超参数，运动学函数集包含五项，包括 jerk, curvature, curvature rate, acceleration, and lateral acceleration；损失函数会正则化目标轨迹以遵守运动学约束，并且目标轨迹优化仅在训练时进行，不参与推理过程；

#### 2.3. Prediction: Occupancy Prediction

Occupancy Prediction

占据网格是一种离散的BEV表征，其中每个单元持有指示其是否被占用的信念（belief），并且Occupancy Prediction任务是发现栅格地图在未来如何变化；
之前的模型获取BEV特征，然后通过RNN结构时序扩展特征预测（参考以下文献）；然而，它们依赖于高度手工制作的聚类后处理来生成每个agent的占用图，因为它们通过将BEV特征作为一个整体压缩到RNN隐藏状态中而基本上是agent不可知的。
- FIERY: Future instance prediction in bird’seye view from surround monocular cameras. In ICCV, 2021.
- ST-P3: End-to-end visionbased autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
- BEVerse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
本文中使用OccFormer在两个方面来统一场景级和智能体级的语义：
- 1. 在未来视野中时，密集的场景特征通过精心设计的注意力模块获取agent级的特征；
- 1. 通过agent级和场景级的特征矩阵乘法，可以容易获得实例级的占用率，而无需繁重的后处理
OccFormer is composed of sequential blocks where indicates the prediction horizon. Note that is typically smaller than in the motion task, due to the high computation cost of densely represented occupancy.
输入包括两个部分：
- agent 特征
- state 特征（从上一层获得）
为了获得的动态空间的先验信息，使用max-pool motion queries from MotionFormer in the modality dimension denoted as ，其中，是特征维度；
然后通过是个时空MLP（temporal-specific MLP）融合，track query 和 current position embedding

其中，[*] 表示串联；BEV特征缩放至1/4分辨率以提升训练效率，并在第一个块中作为输入；

To further conserve training memory, each block follows a downsample-upsample manner with an attention module in between to conduct pixel-agent interaction at 1/8 downscaled feature, denoted as .

Pixel-agent interaction

该模块用于在预测未来的occupancy时，统一场景级和智能体级的理解的；将稠密的特征作为queries，实例级的特征作为keys和values来更新稠密的特征
细节上，通过一个self-attention层来建模遥远网格之间的响应关系，然后用一个cross-attention层建模agent特征和每个网格之间的交互
为对齐像素-agent的一致性，使用了注意力mask来抑制cross-attention，这限制了每个像素，使其只能看到时间时刻智能体的占据情况（occupying）
Masked-attention mask transformer for universal image segmentation. In CVPR, 2022
对稠密特征的更新如下： agent-level feature ；然后上采样到1/4 size of ；进一步增加和block input 为残差连接，处理的结果再送到下一个block

Instance-level occupancy

query-based segmentation works：
- Perpixel classification is not all you need for semantic segmentation. In NeurIPS, 2021
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, 2023.
为了得到原始尺寸的BEV特征，将scene级的特征通过卷积decoder上采样至
对于agent级的特征，通过一个MLP进一步将粗糙的 mask 特征转换为占据特征；（使用而不是能获得更好的效果）

2.4 Planning

将原始导航信号（直行/左转/右转）变为三个可学习的embeddings
构造 plan query，然后用其查询BEV特征，从而获得周围路况信息，并且解码为未来的轨迹点
为避免碰撞，使用 Newton‘s method 优化：

其中，表示原始的规划预测轨迹，表示通过multiple-shooting选择的能最小化损失函数的优化后的规划轨迹；是经典的二值占据地图，通过OccFormer的占用率预测获得

A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proceedings Volumes, 1984
损失函数定义为：其中，是 collision term，它会让agent远离被占据的网格，周围的位置定义为

2.5 Learning

UniAD 训练分为两个阶段（两个阶段的训练更加稳定）：
1. 联合训练感知部分几个epochs（例如6 epochs）
2. 然后联合感知、预测、规划端到端训练模型20 epochs

Shared matching

DETR使用二分匹配算法（bipartite matching algorithm）
- End-to-end object detection with transformers. In ECCV, 2020
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022

3. Experiments

实验使用 nuScenes dataset
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020
消融实验，第一行是一般的多任务学习模型：
预测实现的两个指标
- min ade: 0.71m
- min fde: 1.02m

总结

该框架相比一般的端到端多任务学习方法，使用了 query-based 的设计，以规划结果为目标进行优化，思路比较新，也取得了很好的指标；但是如何把这个模型轻量化，变得易于部署是一个难点；