SEED4D

Abstract

SEED4D is a large-scale synthetic multi-view dynamic 4D driving dataset, data generator and benchmarks. Models for egocentric 3D and 4D reconstruction, including few-shot interpolation and extrapolation settings, can benefit from having images from exocentric viewpoints as supervision signals. No existing dataset provides the necessary mixture of complex, dynamic, and multi-view data. To facilitate the development of 3D and 4D reconstruction methods in the autonomous driving context, we propose a Synthetic Ego--Exo Dynamic 4D (SEED4D) dataset. SEED4D encompasses two large-scale multi-view synthetic urban scene datasets. Our static (3D) dataset encompasses 212K inward- and outward-facing vehicle images from 2K scenes, while our dynamic (4D) dataset contains 16.8M images from 10K trajectories, each sampled at 100 points in time with egocentric images, exocentric images, and LiDAR data. We additionally present a customizable, easy-to-use data generator for spatio-temporal multi-view data creation. Our open-source data generator allows the creation of synthetic data for camera setups commonly used in the NuScenes, KITTI360, and Waymo datasets.

Datasets

Static Ego-Exo Dataset. We introduce a novel dataset for few-view image reconstruction tasks in an autonomous driving setting. Our dataset contains 2002 single-timestep complex outdoor driving scenes, each offering six plus one outward-facing vehicle images and 100 images from exocentric viewpoints on a bounding sphere for supervision. Only a single vehicle in the scene is equipped with this setup. We define ego views to be 900 x 1600 to match the NuScenes dataset and the surround vehicle exo views to be 600 x 800.

Dynamic Ego-Exo Dataset. Our temporal dataset consists of 10.5K driving trajectories well-suited for 4D forecasting, 4D reconstruction, or video prediction tasks. Each trajectory is 100 steps long, corresponding to a driving length of 10 seconds. The 10.5k trajectories come from a total of 498 scenes across all towns. In each scene, the number of vehicles is set to $21$, all equipped with six plus one outward-facing vehicle camera and ten inward-facing surround vehicle exocentric images. The ego views have size 128 x 256 and the exo views are set to 98 x 128.

The static and the dynamic ego--exo view datasets are visualized in the Figure below. They differ mainly in image resolution and trajectory length and have complementary strengths. The static ego--exo dataset contains 12k egocentric views and 200k exocentric views. The dynamic ego--exo dataset contains 6.3M egocentric views and 10.5M exocentric views.

Benchmark Results

Multi-view Novel View Synthesis. We evaluate how well existing methods can reconstruct the scene given many spherical views. We divide the 100 spherical views into training and test data using an 80/20 split.

Monocular Metric Depth Estimation. Since our dataset contains ground truth depth maps we evaluated two recent monocular metric depth estimation methods. The methods we tested without further fine-tuning them on our data.

Single-shot Few-Image Scene Reconstruction. For performing few-image-to-3D reconstruction, we deviate from many of the existing comparisons by targeting an automotive use-case and, hence evaluated the performance of methods on egocentric outward-facing views while supervising resulting novel views with 360° exocentrical spherical views.

Evaluation	Method	PSNR ↑	SSIM ↑	LPIPS ↓	RMSE (m) ↓
Multi-view	SplatFacto	24.458	0.806	0.210	-
Multi-view	NeRFacto	24.936	0.804	0.227	-
Multi-view	K-Planes	25.744	0.816	0.239	-
Monocular Depth	ZoeDepth	-	-	-	12.352
Monocular Depth	Metric3D	-	-	-	7.668
Few-view	K-Planes	11.356	0.463	0.633	-
Few-view	SplatFacto	11.607	0.486	0.658	-
Few-view	NeRFacto	10.943	0.298	0.791	-
Few-view	MVSplat	13.86	0.46	0.66	16.79
Few-view	PixelNeRF	14.500	0.550	0.652	19.235
Few-view	SplatterImage	17.791	0.580	0.568	11.049
Few-view	PixelSplat	18.03	0.60	0.44	7.26
Few-view	6Img-to-3D	18.682	0.726	0.451	6.232
Few-view	sshELF	18.93	0.65	0.50	6.61

We welcome submissions. Please reach out to us via email!

Data Generator

Both datasets contained in this paper are generated using our data generator.. The data generator provides an easy-to-use front-end for the CARLA Simulator. With our data generator, one can easily define parameters such as the town, the vehicle's initial position, the weather, the number of traffic participants, the number and kinds of sensors, and their position (both ego and exo-views).

Several different sensor readings are available within our dataset:

The dataset provides RGB, depth maps, semantic and instance segmentation for each image.
Furthermore, the 3D bounding boxes of each vehicle in the scene is also available

Towns

The generated data come from Towns 1 to 7 and 10HD. Towns 1, 3 to 7, and 10HD are used for training and we left all 100 scenes from Town 2 for testing.

Town	Number of Scenes	Split	Description
Town01	255	Train	A small, simple town with a river and several bridges.
Town03	265	Train	A larger, urban map with a roundabout and large junctions.
Town04	372	Train	A small town embedded in the mountains with a special "figure of 8" infinite highway.
Town05	302	Train	Squared-grid town with cross junctions and a bridge. It has multiple lanes per direction. Useful to perform lane changes.
Town06	436	Train	Long many lane highways with many highway entrances and exits. It also has a Michigan left.
Town07	116	Train	A rural environment with narrow roads, corn, barns and hardly any traffic lights.
Town10	155	Train	A downtown urban environment with skyscrapers, residential buildings and an ocean promenade.
Town02	101	Val	A small simple town with a mixture of residential and commercial buildings.

BibTeX

@article{kaestingschaefer2024seed4dsyntheticegoexodynamic,
      title={SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark}, 
      author={Marius Kästingschäfer and Théo Gieruc and Sebastian Bernhard and Dylan Campbell and Eldar Insafutdinov and Eyvaz Najafli and Thomas Brox},
      year={2024},
      eprint={2412.00730},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.00730},
      journal      = {arXiv preprint},
      volume     = {arXiv:2412.00730},
}

Acknowledgement

We thank Nerfies for the website template.

Town	Input			Supervision
Town 1
Town 1
Town 2
Town 2
Town 3
Town 3
Town 4
Town 4
Town 5
Town 5
Town 6
Town 6
Town 7
Town 7

SEED4D: A Synthetic Ego-Exo Dynamic 4D Data Generator, Dataset and Benchmark