JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation
Computer Vision · Multimodal Learning
Chengming Xu
Senior Researcher at Youtu Lab, Tencent · Ph.D. in Data Science, Fudan University
I work on deep learning for computer vision with limited supervision, with recent interests in visual in-context learning, multimodal reasoning, and controllable generation.
Research interests: few-shot learning, visual in-context learning, vision-language models, and video generation/editing.
Selected Recent Works
View full list →* Equal contribution / core contributor. † Corresponding author.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
Dual Latent Memory for Visual Multi-agent System
FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Towards Reliable and Holistic Visual In-Context Learning Prompt Selection