Autonomous Driving | Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

通过预测未来，避免风险才能安全高效地行驶；本文提出一种通过世界模型生成多视角视频数据来训练规划模型的方法；

Project Page: https://drive-wm.github.io
Code: https://github.com/BraveGroup/Drive-WM

1. Introduction

分布外场景（ out-of-distribution (OOD) cases）：车辆不在车道中心线上时，由于悬链没见过这样的case造成规划结果异常；
为了解决上述问题，提出了一个预测决策结果的世界模型，来预知潜在风险；
世界模型：
- Recurrent world models ¨ facilitate policy evolution. NeurIPS, 31, 2018.
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
- A path towards autonomous machine intelligence. 2022
三个主要挑战：
- world model要求高分辨率的像素空间，否则不能有效表达许多细粒度的以及不能矢量化的事件；向量空间世界模型需要额外的向量注释（vector annotations）并受到感知模型的状态估计噪声的影响
- 生成满足多视角一致性视频数据困难；此前的工作主要是单视角图片和多视角图片的生成；
- 适应各种异构条件困难（various heterogeneous conditions）；例如天气、照明、ego行为、车辆布局等
主要贡献：
- 提出了Drive-WM多视角世界模型，生成多视角、高质量、一致性的自动驾驶场景视频数据；‘
- nuScenes数据集上实验证明生成质量高并且可控；
- 率先探索了世界模型和端到端模型的结合；

2.1. Video Generation and Prediction

视频生成：
- VAE-based：
  - Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020
  - Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  - High fidelity video prediction with large stochastic recurrent neural networks. NeurIPS, 32, 2019
  - Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021
- GAN-based
  - Generating long videos of dynamic scenes. NeurIPS, 35:31769–31781, 2022.
  - Stylevideogan: A temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224, 2021
  - Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021.
  - Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022.
  - Mocogan: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
- flow-based
  - Stochastic image-to-video synthesis using cinns. In CVPR, pages 3742–3753, 2021
  - Videoflow: A conditional flow-based model for stochastic video generation. In ICLR, 2020.
- auto-regressive models
- Long video generation with time-agnostic vqgan and timesensitive transformer. In ECCV, pages 102–118. Springer, 2022
- Scaling autoregressive video models. In ICLR, 2020.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- diffusion-based n [40, 44, 45] n [23, 26
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804, 2022
- High-resolution image syn- ¨ thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Flexible diffusion modeling of long videos. NeurIPS, 35:27953–27965, 2022
- Diffusion models for video prediction and infilling. TMLR, 2022
视频预测
- DriveGAN：Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021.
- GAIA-1：Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
- DriveDreamer：Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.

2.2. World Model for Planning

Dreamer 从历史实验中学习动力学模型，然后在潜在空间中预测状态值和行为；
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020
DreamerV2 在Atari上进一步提升到了人类操作水平；
- Mastering atari with discrete world models. In ICLR, 2021.
DreamerV3 进一步在奖励稀疏的 Minecraft 环境进行长期的生存挑战；
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
Daydreamer 运用 Dreamer 在线训练机器人完成运动和操作任务，并且不改变其超参数；
- World models for physical robot learning. In CoRL, pages 2226–2240. PMLR, 2023.
MILE：实现用model-based imitation learning在carla中自动驾驶
- Model-based imitation learning for urban driving. NeurIPS, 35:20703–20716, 2022.

3. Multi-view Video Generation

3.1. Joint Modeling of Multiview Video

Formulation

数据集为，表示个视角下个时间步的图像，；给定编码视频的潜在表征为；
扩散的输入为，其中，和是扩散时间步的噪声参数；
去噪模型，(parameterized by spatial parameters , temporal parameters and multiview parameters ) 接收扩散的并且最小化去噪分数以匹配目标：其中是条件，目标，随机噪声，表示扩散时间内的均匀分布；

Temporal encoding layers

Multiview encoding layers

Multiview temporal tuning

3.2. Factorization of Joint Multiview Modeling

为保证多视角的一致性，引入了分布分解（distribution factorization）

Formulation

表示对第个视角的采样，3.1中已经建立了多视角的联合分布，可得由此可得，新的视角他图像取决于已有的视角图像；这样的条件概率分布能保证多视角具有更好的一致性；但是，这种自回归生成（autoregressive generation）效率低下，使得这种完全分解在实践中不可行
上述问题有两种处理方式，（1）reference视角，（2）缝合视角（stitched views）；例如，在nuScene中，参考视角可以是 {F, BL, BR}（F: front. B: back. L: left. R: right），缝合视角一般为 {FL, B, FR}
（ We use the term “stitched” because a stitched view appears to be “stitched” from its two neighboring reference views）
视角分解的方式：

考虑时序相干性，进一步改写公式为：其中表示之前帧生成的内容；

3.3. Unified Conditional Generation

使用的信息：
- initial context frames,
- text descriptions,
- ego actions,
- 3D boxes,
- BEV maps,
- reference views
Image condition:
- 给定图像条件被编码并拉平为一个维的embedding ，使用 ConvNeXt 作为encoder，不同的图像可以拼接为最初的维
Layout condition:
- 布局条件参考 3D boxes, HD maps, and BEV segmentation，本文中让 3D boxes 和 HD maps 映射到 2D 表征视角中，获得序列的embedding ；其中，表示映射的布局和BEV分割实例的embeddings总数
Text condition:
- 使用预训练的融合模型 CLIP 作为文本编码器；结合视角信息，天气，和光照生成文本描述，embedding可以表示为
- CLIP: Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021
Action condition:
- 行为条件是世界模型必要的输入之一，定义一个时间步的行为为，表示自车从当前位置移动到下个时刻位置的距离；使用MLP来映射行为为embedding
A unified condition interface:
- 现在所有的条件都已经映射到了d维的特征空间，将他们拼接后，输入到 denoising UNet ，获得时刻统一的条件嵌入特征（unified condition embeddings）: 其中，表示第t个生成的帧；0表示当前帧的时刻；
- 最终，和潜在变量逐帧地在3D UNet中进行cross attention

4. World Model for End-to-End Planning

4.1. Tree-based Rollout with Actions

在每一步用世界模型生成未来的图像和候选轨迹，然后基于图像的奖励函数选择最优的轨迹拓展到规划树种，如上图所示
预训练的规划器将多视角图像作为输入，并采样出可能的轨迹候选；定义每个轨迹的行为在时间步为；其中和是自车的位置；
生成视频之后，使用一个基于图像的奖励函数选择最优轨迹作为决策结果；
Vad: Vectorized scene representation for efficient autonomous driving

4.2. Image-based Reward Function

首先，使用一个3D目标检测器和一个在线HDmap预测器获取生成视频的感知结果，然后定义地图奖励和目标奖励（参考传统的规划器）；
地图奖励包括了两方面：
- （1）离路沿的距离，鼓励车辆在正确的区域行驶；
- （2）中心线一致性，防止频繁的变道；
目标奖励：
- （1）离他车的距离（横向和纵向）；
总的奖励为地图奖励和目标奖励的乘积；

5. Experiments

Dataset

We adopt the nuScenes [5] dataset for experiments, which is one of the most popular datasets for 3D perception and planning. It comprises a total of 700 training videos and 150 validation videos. Each video includes around 20 seconds captured by six surround-view cameras.
Training scheme

We crop and resize the original image from 1600 × 900 to 384 × 192. Our model is initialized with Stable Diffusion checkpoints [44]. All experiments are conducted on A40 (48GB) GPUs. For additional details, please refer to the appendix
Quality evaluation

To evaluate the quality of the generated video, we utilize FID (Frechet Inception Distance) [24] and FVD (Frechet Video Distance) [57] as the main metrics.
Multiview consistency evaluation
Controllability evaluation
Planning evaluation
Model variants

We support action-based video generation and layout-based video generation. The former gives the ego action of each frame as the condition, while the latter gives the layout (3D box, map information) of each frame

总结

挺不错的一篇用世界模型做end2end自动驾驶的工作，在行为规划环节比较粗糙，但丰富的实验结果和整体框架具有启发性；

1. Introduction

2. Related Works

2.1. Video Generation and Prediction

2.2. World Model for Planning

3. Multi-view Video Generation

3.1. Joint Modeling of Multiview Video

Formulation

Temporal encoding layers

Multiview encoding layers

Multiview temporal tuning

3.2. Factorization of Joint Multiview Modeling

Formulation

3.3. Unified Conditional Generation

4. World Model for End-to-End Planning

4.1. Tree-based Rollout with Actions

4.2. Image-based Reward Function

5. Experiments

总结