Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

Under Submission
Code(Coming soon) arXiv(Coming soon)

Abstract

Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.

Methodology

pipeline

FYM begins with an efficient 3DGS model to render temporally consistent portraits. Next, we develop a diffusion model that intuitively and inherently learns the motion trajectories changes at different scales and pixel coordinates from the original video frames. Finally, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark coordinates in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions.

Comparisons with Video Editing Methods

The Editing Results of Our Method: InstructPix2Pix+FYM

Prompt: Change him into Lego style Prompt: Give him a mustache

original

edited results

Prompt: Change her into a character in Super Mario Prompt: Change her into oil painting style

original

edited results

The Editing Results of Our Method: Neural Style Transfer+FYM

Prompt:

style

original

edited results

The Editing Results of Our Method: ICLight+FYM

Prompt:Detailed face, sunshine, outdoor, warm atmosphere, right

Prompt:Detailed face,neon light, city, right

Prompt:Detailed face, sunset over sea, left