通过预测未来,避免风险才能安全高效地行驶;本文提出一种通过世界模型生成多视角视频数据来训练规划模型的方法;
Project Page: https://drive-wm.github.io
Code: https://github.com/BraveGroup/Drive-WM
1. Introduction
分布外场景( out-of-distribution (OOD) cases):车辆不在车道中心线上时,由于悬链没见过这样的case造成规划结果异常;
为了解决上述问题,提出了一个预测决策结果的世界模型,来预知潜在风险;
世界模型:
- Recurrent world models ¨ facilitate policy evolution. NeurIPS, 31, 2018.
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
- A path towards autonomous machine intelligence. 2022
三个主要挑战:
- world model要求高分辨率的像素空间,否则不能有效表达许多细粒度的以及不能矢量化的事件;向量空间世界模型需要额外的向量注释(vector annotations)并受到感知模型的状态估计噪声的影响
- 生成满足多视角一致性视频数据困难;此前的工作主要是单视角图片和多视角图片的生成;
- 适应各种异构条件困难(various heterogeneous conditions);例如天气、照明、ego行为、车辆布局等
主要贡献:
- 提出了Drive-WM多视角世界模型,生成多视角、高质量、一致性的自动驾驶场景视频数据;‘
- nuScenes数据集上实验证明生成质量高并且可控;
- 率先探索了世界模型和端到端模型的结合;
2. Related Works
2.1. Video Generation and Prediction
视频生成:
VAE-based:
- Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- High fidelity video prediction with large stochastic recurrent neural networks. NeurIPS, 32, 2019
- Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021
GAN-based
- Generating long videos of dynamic scenes. NeurIPS, 35:31769–31781, 2022.
- Stylevideogan: A temporal generative model using a pretrained stylegan. arXiv preprint arXiv:2107.07224, 2021
- Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022.
- Mocogan: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
flow-based
- Stochastic image-to-video synthesis using cinns. In CVPR, pages 3742–3753, 2021
- Videoflow: A conditional flow-based model for stochastic video generation. In ICLR, 2020.
auto-regressive models
Long video generation with time-agnostic vqgan and timesensitive transformer. In ECCV, pages 102–118. Springer, 2022
Scaling autoregressive video models. In ICLR, 2020.
Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
diffusion-based n [40, 44, 45] n [23, 26
Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804, 2022
High-resolution image syn- ¨ thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022.
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
Flexible diffusion modeling of long videos. NeurIPS, 35:27953–27965, 2022
Diffusion models for video prediction and infilling. TMLR, 2022
视频预测
- DriveGAN:Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021.
- GAIA-1:Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
- DriveDreamer:Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
2.2. World Model for Planning
- Dreamer
从历史实验中学习动力学模型,然后在潜在空间中预测状态值和行为;
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020
- DreamerV2 在Atari上进一步提升到了人类操作水平;
- Mastering atari with discrete world models. In ICLR, 2021.
- DreamerV3 进一步在奖励稀疏的 Minecraft 环境进行长期的生存挑战;
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Daydreamer 运用 Dreamer
在线训练机器人完成运动和操作任务,并且不改变其超参数;
- World models for physical robot learning. In CoRL, pages 2226–2240. PMLR, 2023.
- MILE:实现用model-based imitation learning在carla中自动驾驶
- Model-based imitation learning for urban driving. NeurIPS, 35:20703–20716, 2022.
3. Multi-view Video Generation
3.1. Joint Modeling of Multiview Video
Formulation
数据集为
, 表示 个视角下 个时间步的图像, ;给定编码视频的潜在表征为 ; 扩散的输入为
,其中, 和 是扩散时间步 的噪声参数; 去噪模型
,(parameterized by spatial parameters , temporal parameters and multiview parameters ) 接收扩散的 并且最小化去噪分数以匹配目标: 其中 是条件,目标 ,随机噪声 , 表示扩散时间 内的均匀分布;
Temporal encoding layers
Multiview encoding layers
Multiview temporal tuning
3.2. Factorization of Joint Multiview Modeling
- 为保证多视角的一致性,引入了分布分解(distribution factorization)
Formulation
表示对第 个视角的采样,3.1中已经建立了多视角的联合分布 ,可得 由此可得,新的视角他图像取决于已有的视角图像;这样的条件概率分布能保证多视角具有更好的一致性;但是,这种自回归生成(autoregressive generation)效率低下,使得这种完全分解在实践中不可行 上述问题有两种处理方式,(1)reference视角
,(2)缝合视角(stitched views) ;例如,在nuScene中,参考视角可以是 {F, BL, BR}(F: front. B: back. L: left. R: right),缝合视角一般为 {FL, B, FR} ( We use the term “stitched” because a stitched view appears to be “stitched” from its two neighboring reference views)
视角分解的方式:
3.3. Unified Conditional Generation
使用的信息:
- initial context frames,
- text descriptions,
- ego actions,
- 3D boxes,
- BEV maps,
- reference views
Image condition:
- 给定图像条件
被编码并拉平为一个 维的embedding ,使用 ConvNeXt 作为encoder,不同的图像可以拼接为最初的 维
- 给定图像条件
Layout condition:
- 布局条件参考 3D boxes, HD maps, and BEV segmentation,本文中让 3D
boxes 和 HD maps 映射到 2D 表征视角中,获得序列的embedding
;其中, 表示映射的布局和BEV分割实例的embeddings总数
- 布局条件参考 3D boxes, HD maps, and BEV segmentation,本文中让 3D
boxes 和 HD maps 映射到 2D 表征视角中,获得序列的embedding
Text condition:
使用预训练的融合模型 CLIP 作为文本编码器;结合视角信息,天气,和光照生成文本描述,embedding可以表示为
CLIP: Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021
Action condition:
- 行为条件是世界模型必要的输入之一,定义一个时间步的行为为
,表示自车从当前位置移动到下个时刻位置的距离;使用MLP来映射行为为embedding
- 行为条件是世界模型必要的输入之一,定义一个时间步的行为为
A unified condition interface:
现在所有的条件都已经映射到了d维的特征空间,将他们拼接后,输入到 denoising UNet ,获得
时刻统一的条件嵌入特征(unified condition embeddings): 其中, 表示第t个生成的帧;0表示当前帧的时刻; 最终,
和潜在变量 逐帧地在3D UNet中进行cross attention
4. World Model for End-to-End Planning
4.1. Tree-based Rollout with Actions
- 在每一步用世界模型生成未来的图像和候选轨迹,然后基于图像的奖励函数选择最优的轨迹拓展到规划树种,如上图所示
- 预训练的规划器将多视角图像作为输入,并采样出可能的轨迹候选;定义每个轨迹的行为
在时间步 为 ;其中 和 是自车的位置; - 生成视频之后,使用一个基于图像的奖励函数选择最优轨迹作为决策结果;
- Vad: Vectorized scene representation for efficient autonomous driving
4.2. Image-based Reward Function
- 首先,使用一个3D目标检测器和一个在线HDmap预测器获取生成视频的感知结果,然后定义地图奖励和目标奖励(参考传统的规划器);
- 地图奖励包括了两方面:
- (1)离路沿的距离,鼓励车辆在正确的区域行驶;
- (2)中心线一致性,防止频繁的变道;
- 目标奖励:
- (1)离他车的距离(横向和纵向);
- 总的奖励为地图奖励和目标奖励的乘积;
5. Experiments
Dataset
We adopt the nuScenes [5] dataset for experiments, which is one of the most popular datasets for 3D perception and planning. It comprises a total of 700 training videos and 150 validation videos. Each video includes around 20 seconds captured by six surround-view cameras.
Training scheme
We crop and resize the original image from 1600 × 900 to 384 × 192. Our model is initialized with Stable Diffusion checkpoints [44]. All experiments are conducted on A40 (48GB) GPUs. For additional details, please refer to the appendix
Quality evaluation
To evaluate the quality of the generated video, we utilize FID (Frechet Inception Distance) [24] and FVD (Frechet Video Distance) [57] as the main metrics.
Multiview consistency evaluation
Controllability evaluation
Planning evaluation
Model variants
We support action-based video generation and layout-based video generation. The former gives the ego action of each frame as the condition, while the latter gives the layout (3D box, map information) of each frame
总结
挺不错的一篇用世界模型做end2end自动驾驶的工作,在行为规划环节比较粗糙,但丰富的实验结果和整体框架具有启发性;