VIPS: Vehicle-Infrastructure Cooperative Planning Benchmark via Pseudo-Simulation

Cho, Hoonhee; Kang, Jae-Young; Lee, Giwon; Yang, Hyemin; Park, Heejun; Yoon, Kuk-Jin

VIPS: Vehicle-Infrastructure Cooperative Planning Benchmark via Pseudo-Simulation

Hoonhee Cho^*, Jae-Young Kang^*, Giwon Lee^*, Hyemin Yang^*, Heejun Park^*, Kuk-Jin Yoon^†

KAIST
ECCV 2026
^*Indicates Equal Contribution ^†Corresponding Author

Paper (Coming Soon) Code Dataset

Code & Dataset are now available

Two-stage V2X pseudo-simulation overview

Two-stage V2X pseudo-simulation. (a) Stage 1 evaluates the AV trajectory using real-world observations. (b, c) Stage 2 probes the policy around the expert endpoint by sampling predefined starting points and generating synthetic observations. The synthesized views consistently render the AV at novel positions across both vehicle and infrastructure images. Stage 2 scores are aggregated with weights that assign higher importance to samples whose starting points are closer to the Stage 1 endpoint.

Abstract

End-to-end autonomous driving (E2E-AD) in urban environments requires robust decision-making under partial observability and complex multi-agent interactions. Severe occlusions and dense traffic at intersections limit the perception capability of single-agent systems, motivating recent efforts on Vehicle-to-Infrastructure (V2I) cooperation for perception and planning. However, existing evaluation protocols face a fundamental trade-off: open-loop evaluation fails to capture error accumulation and recovery from deviations, while closed-loop evaluation is costly, difficult to scale, and often relies on simulated environments that may suffer from domain gaps.

To bridge this gap, we propose VIPS, a benchmark for cooperative autonomous driving in V2I settings based on pseudo-simulation. VIPS extends pseudo-simulation to multi-agent scenarios by integrating vehicle and infrastructure observations and introducing interaction-aware perturbations that approximate realistic traffic dynamics. This enables scalable yet realistic evaluation of robustness and error propagation without full simulation. We further present CoS-V2X, a cooperative planning framework based on sparse representations. CoS-V2X models vehicle–infrastructure interactions using compact features for efficient communication and robust decision-making under heterogeneous observations. Code and dataset will be publicly available.

Contributions

VIPS benchmark. We extend the two-stage pseudo-simulation paradigm to the Vehicle-to-Infrastructure (V2I) setting, integrating paired vehicle-side and infrastructure-side observations with interaction-aware perturbations, enabling scalable yet realistic evaluation of robustness and error propagation directly on real-world data.
Map and novel-view annotations. We augment the V2X-Real dataset with manually annotated vector maps and a paired novel-view generation pipeline for both vehicle (3D Gaussian Splatting + diffusion refinement) and infrastructure (masking + inpainting) observations, supporting geometry-aware, planning-oriented evaluation.
CoS-V2X framework. We propose a simple yet effective sparse, anchor-based cooperative planning framework that exchanges only a small set of informative instances, achieving the best overall performance while requiring lower communication bandwidth than dense BEV-based cooperative methods.

CoS-V2X Framework

Overview of the CoS-V2X cooperative planning framework

Overview of CoS-V2X. Left: the overall framework performs cooperative perception (3D detection and online mapping) through interaction between the infrastructure and the ego vehicle, followed by sequential motion prediction and planning. Right: the cooperative perception module uses a symmetric anchor-based design across agents. Only the top-K high-confidence infrastructure instances are transmitted, exchanged via bidirectional cross-attention, and merged with confidence-weighted fusion. This sparse representation reduces communication overhead while preserving robust cross-agent reasoning. (DA: Deformable Aggregation; FF: Feedforward Network; CA: Cross-Attention; SA: Self-Attention; Pred: regression and classification heads.)

Benchmark Construction

We accumulate LiDAR point clouds using ego poses and manually annotate vector maps for V2X-Real, enabling geometry-aware planning evaluation.

Given a virtual relocation of the ego vehicle, we synthesize paired novel views: vehicle views via 3D Gaussian Splatting with a render enhancer, and infrastructure views via masking and temporal inpainting.

Qualitative comparison of novel-view generation

Qualitative ablation of novel-view generation. For the fixed infrastructure camera, our masking-and-inpainting approach yields cleaner results than 3DGS, where objects appear small.

Human annotators rank candidate trajectories by driving quality, which we use to validate that the EPDMS-based metric aligns with human judgment.

BibTeX

The citation will be available once the paper is published.

Our Related Work

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting (NeurIPS 2025). Hoonhee Cho^*, Jae-Young Kang^*, Giwon Lee^*, Hyemin Yang^*, Heejun Park, Seokwoo Jung, Kuk-Jin Yoon. A viewpoint-robust end-to-end driving framework that leverages feed-forward 3D Gaussian Splatting to synthesize novel views for robust planning — the basis on which VIPS builds its V2I pseudo-simulation. [Project] [arXiv] [Code]

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models (arXiv 2026). Hoonhee Cho^*, Giwon Lee^*, Jae-Young Kang^*, Hyemin Yang^*, Heejun Park^*, Kuk-Jin Yoon^†. A trajectory-driven learning paradigm with a world model that captures domain-invariant driving intent, enabling a single unified end-to-end model to perform robustly across heterogeneous domains (nuScenes, NAVSIM, Waymo). [arXiv]

^*Equal Contribution ^†Corresponding Author