Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video–instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
A context-aware VLM instructor encodes the system prompt, instruction, first frame of the source video, and an optional reference image into VLM tokens. A trainable token refiner aligns these tokens to the pretrained DiT latent space. The VAE encodings of the source video are added to form context-aware noise tokens for denoising.
We inject stochasticity via Flow-SDE to generate diverse samples, score them with our reward system, and compute a GRPO loss from the resulting relative advantages to update the model. For efficiency, we optimize a LoRA instead of full fine-tuning.
Please scroll right to view all baselines.
Please press the play button on the top right to pause all videos for more detailed frame-by-frame comparison.
Please refresh the page if the videos are not synchronized properly.
Please scroll right to view all baselines.
Please press the play button on the top right to pause all videos for more detailed frame-by-frame comparison.
Please refresh the page if the videos are not synchronized properly.
More qualitative results on complex instructions that are non-trivial and challenging to be synthesized by the data construction pipeline.