PreciseControl : Enhancing Text-to-Image Diffusion Models with Fine-Grained Attribute Control

ECCV 2024

1Vision and AI Lab, Indian Institute of Science, 2Indian Institute of Technology, Kharagpur
scheme

We enable precise attribute control in Text-to-Image Generation by combining Diffusion Models and StyleGANs

Abstract

Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled W+ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the W+ space, we train a latent mapper to translate latent codes from W+ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.

Personalization


Given a single portrait image, we extract its w latent representation from encoder EGAN. The latent w along with diffusion timestep t are passed through the latent adaptor M to generate a pair of time-dependent token embeddings (vt1, vt2) representing the input subject. Finally, the token embeddings are combined with arbitrary prompts to generate customized images.


scheme

Click to send your pal on a virtual vacation.

personalization rect1b rect2b rect3b rect4b rect5b rect6b rect7b rect8b rect9b

Attribute Editing/Control



We map the given input image into w latent code, which is shifted by a global linear attribute edit direction to obtain edited latent code w*. The edited latent code w* is then passed through the T2I model to obtain fine-grained attribute edits. The scalar edit strength parameter β can be changed to obtain continuous attribute control. Our method performs disentangled edits for various attributes while preserving identity and generalizing to in-the-wild faces, styles, and multiple-persons. Identity Interpolation. We can perform smooth interpolation between identities by interpolating between the corresponding w codes.

scheme


Slide the bars to edit the identity.

Editing
0
0


Additional Edit Directions

We perform attribute editing by using the edit directions obtained from InterfaceGAN. The edit directions generalize well in T2I models, indicating easy integration of any off-the-shelf StyleGAN editing method in our framework.

interface

Composing Multiple Subjects



We run multiple parallel diffusion processes, one for each subject and one for the background and fuse them using an instance segmentation mask at each denoising step. The instance mask can be user-provided or obtained by segmenting a generated image of two subjects by the same T2I model. Note that the diffusion process for each subject is passed through its corresponding fine-tuned model, which results in excellent identity preservation.


scheme

Who do you wanna combine?

multi-pers rect1b rect2b rect1b rect2b rect1b rect2b

Multi-person Attribute Control


We can perform continuous edits on individual characters in the scene with minimal entanglement. Observe the smooth edit transformations for different attributes while preserving identity.

Slide the bars to take control.

Editing 1 0
Editing 2 0
Editing 3 0
Editing 4 0

Applications


Face Restoration


The disentangled W+ latent space serves as an excellent face prior in StyleGANs. Leveraging this, we perform face restoration by projecting a corrupted face image into the W+ latent space using the StyleGAN encoder to condition the T2I model. The resultant generated images look realistic and can be embedded and edited using T2I capabilities. Personalization of such corrupted images is challenging as the model could easily overfit the given single-face image.


rest-blur
rest-masked

Sketch-to-image

We use psp StyleGAN encoder trained to map edge images to the W+ latent space for sketch to image generation. The obtained 𝕊 latent code can then be used to condition the T2I model for real image generation.


rest-sketch

Acknowledgments

The authors are thankful to Aniket Dashpute, Ankit Dhiman and Abhijnya Bhat for reviewing the draft and providing helpful feedback. Rishubh Parihar is partly supported by PMRF from Govt. of India and Kotak AI Center at IISc.


BibTeX

@misc{rishubh2024precisecontrol,
      title={PreciseControl : Enhancing Text-to-Image Diffusion Models with Fine-Grained Attribute Control},
      author={Rishubh Parihar, Sachidanand VS, Sabariswaran Mani, Tejan Karmali, R. Venkatesh Babu},
      joutnal={Eurpean Conference on Computer Vision},      
      year={2024}, 
}