PREX: Faithful 4D Video Editing with Region-Aware Conditioning
Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a Region-Aware Adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video quality scores with targeted diagnostics for preservation drift, ghost leakage, boundary copying, and temporal instability.
Pixels backed by valid source observations. PREX retrieves appearance from nearby source frames with visibility and depth consistency, ensuring faithful preservation of observed content.
Unsupported but within-scene regions (disocclusions). PREX exposes these regions to the diffusion model for plausible in-scene completion with spatial-temporal context.
Divide target-frame pixels into Preserve, Reveal, and Expand regions based on observation support. Compute geometric confidence maps from projection coverage, instance consistency, and depth variation.
Construct appearance cues from valid source observations using visibility, depth, instance, and view-time checks. Unsupported pixels receive only weak or low-confidence conditioning.
A lightweight adapter maps appearance + confidence + region masks into residual control tokens injected into a frozen video diffusion backbone. Trained with proxy tasks — no paired editing data required.
PREBench is the first region-aware diagnostic benchmark for 4D video editing. It provides source videos, edited 4D proxies, target cameras, and region masks (Preserve / Reveal / Expand) for each editing case — enabling targeted evaluation of preservation fidelity, ghost leakage, boundary artifacts, and temporal stability. It covers 350 real-world editing cases spanning camera-only and joint camera+object motion edits.
| Metric Category | Metrics | What It Evaluates |
|---|---|---|
| Preserve | P-LPIPS, P-DISTS, P-TempDrift, P-Dyn-LPIPS | Preservation fidelity, appearance drift, temporal stability of observed content |
| Reveal | R-Ghost, R-Seam | Ghost leakage from source, seam visibility at reveal boundaries |
| Expand | E-Temp, E-Seam, E-Copy | Extrapolation coherence, boundary copying, degenerate texture repetition |
We built an interactive scene editor that enables users to intuitively explore and manipulate 4D scene representations. The editor provides real-time camera control, object selection, and region-aware editing capabilities — allowing researchers and artists to interact with the Preserve / Reveal / Expand framework in a visual, hands-on manner.
@misc{hu2026preserverevealexpandfaithful,
title={Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning},
author={Zhangchi Hu and Wenzhang Sun and Xiangchen Yin and Jiahui Yuan and Chunfeng Wang and Hao Li and Kun Zhan and Xiaoyan Sun},
year={2026},
eprint={2605.20961},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.20961},
}