MotionStyle: Motion Disentanglement by Latent Space Decomposition

Real-world objects perform complex motions that involve multiple decomposable motion components. For example, while talking, a person continuously changes their expressions, head pose, and body pose.

Recently, a range of methods have been proposed for unconditional video generation. However, these approaches provide limited control for manipulating an object’s motion in the generated video. In this work, we propose object motion decomposition in videos using a pretrained image GAN model. We generate disentangled linear motion subspaces in the latent space of widely used StyleGAN2 models that are semantically meaningful and control a single explainable motion component. We evaluate the disentanglement properties of motion subspaces on faces and car datasets with extensive quantitative and qualitative experiments.

Furthermore, we show results of the proposed motion decomposition on downstream video editing tasks of selective motion transfer, e.g., transferring only facial expressions and video stabilization without explicitly training for these tasks. As we are utilizing pretrained StyleGAN2 models, our method seamlessly integrates with latent-based editing and stylization methods.

Car Motion Control

Motion decomposition can enable multiple downstream tasks such manipulating the strength of a single motion component. Here we modulate the strength of the car motion by scaling the transition latent trajectory.

Face Expression Control

Here we module the strength of the expressions of the subject. Observe the large facial motions for the large strength values.

As we use StyleGAN2 as the backbone, our method seemlessly integrate with latent space manipulation based attribute editing. We first embed the source image and the driving video into the W+ latent space and then transfer the trajectory to the source latent. Finally, we perform attribute editing on the transferred latent trajectory to obtain reenactment results with editing.

Using StyleGAN also enables to perform stylization of the video by adapting the generator to a new domain using StyleGAN-nada

There's a lot of excellent work which helped our contribution.

Stitch it in Time: GAN-Based Facial Editing of Real Videos

StyleGan2 : The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven
unconditional generative image modeling.

StyleGAN-NADA converts a pre-trained generator to new domains using only a textual prompt and no training data.

Everything is There in Latent Space: Attribute Editing and Attribute Style Manipulation by StyleGAN Latent Space Exploration.

We Never Go Out Of Style: Motion Decomposition by Subspace Decomposition of Latent Space

We proposed a new task for motion decomposition into interpretable motion components. E.g., the proposed method can decompose talking mouth video into expressions and head pose motion.

Abstract

Video

Car Motion Control

Face Expression Control

Reenactment with Atrribute Editing

Reenactment with Stylization

Related Links

BibTeX