Compass Control: Multi-Object Orientation Control for Text-to-Image Generation

CVPR 2025

Rishubh Parihar*, Vaibhav Agarwal*, Sachidanand VS, R. Venkatesh Babu

Compass control enables precise and continuous object orientation control in text-to-image diffusion models. Given an input orientation angle along with a text prompt, our method can generate realistic scenes following the text prompts with object-centric orientation control.

Abstract

We address the problem of multi-object orientation control in text-to-image diffusion models, enabling precise orientation control for diverse multi-object scenes. The key idea is to condition the diffusion model with a set of orientation-aware compass tokens, one for each object, along with text tokens. These compass tokens are predicted by a light-weight encoder. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D-assets. However, directly training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training, and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts.

Method

We encode the object orientation as an additional attribute in the text embedding space of the T2I model. Specifically, we introduce a special token, dubbed as compass token c along with each token in the prompt (e.g.,`A photo of c1 SUV and c2 sedan on a road.'). Each compass token is predicted by a lightweight encoder model taking the prescribed orientation angle as input. This formulation preserves the original interface of the base T2I model and enables precise object-centric orientation control in multi-object scenes. We train the encoder model and fine-tune the denoising U-Net with LoRA on a synthetic dataset of scenes containing one or two 3D assets placed in diverse layouts on an empty floor.

Coupled Attention Localization (CALL)

Directly training the compass token conditioning invariably results in attribute mixing between objects. In our case, the input orientation of a one object is wrongly transferred to other object. To address this, we constrain the cross-attention maps for both the compass token and the corresponding object token inside a given 2D bounding box. This enables tight association between the object and the compass tokens. Additionally, it enables additional control over the object location during generation. Notably, we provide only coarse bounding boxes as input that can also be randomly spawn during inference without user intervention.

BibTeX

@misc{rishubh2025compasscontrol,
       title={Compass Control: Multi-Object Orientation Control for Text-to-Image Generation},
       author={Rishubh Parihar, Vaibhav Agarwal, Sachidanand VS, R. Venkatesh Babu},
       journal={Conference on Computer Vision and Pattern Recognition},
       year={2025} 
}

This website is built on the Clarity Template, designed by Shikun Liu.