Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, itβs particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.
We propose SA-PlaceNet, a placement network that predicts plausible 3D object locations from a single scene image. It learns the distribution of 3D object bounding boxes conditioned on the input image. To train the model, we synthetically generate a paired dataset of scene images and corresponding 3D box placements by inpainting objects into the scenes. However, because real scenes contain only a few objects, the training signal is weak. Directly training SA-PlaceNet on this limited dataset leads to overfitting to those few object placements. To overcome this, we introduce a geometry-aware augmentation module that refines box locations by interpolating with neighboring object positions. Additionally, instead of predicting a fixed set of 3D boxes for the given input image, SA-PlaceNet outputs distribution over possible 3D boxes. These modifications enable the model to generate a continuous, diverse set of plausible object placements.
We generate realistic scenes by rendering cars within predicted 3D bounding boxes. Our proposed pipeline leverages the synthetic ShapeNet dataset to create high-quality car renderings. Specifically, we sample a car model from ShapeNet and render it in Blender according to the target bounding box. To enhance realism, we then transform the rendered image using edge-conditioned ControlNet. This process ensures realistic renderings of the car that align with the 3D bounding box and blend seamlessly into the background scenes.
We use the proposed scene completion framework to augment datasets for monocular 3D object detection. Given a dataset like KITTI, we train SA-PlaceNet to predict plausible locations for placing new cars. We augment the original scenes by placing cars at predicted 3D locations and add the corresponding 3D boxes to the label files.
We augment the widely uses autonomous driving dataset with our generated augmentations to train several monocular 3D detectors. The trained model consistently outperforms the baselines across model architectures.
The authors are thankful to Tejan Karmali, Ankit Dhiman and Abhijnya Bhat for reviewing the draft and providing helpful feedback. Rishubh Parihar is partly supported by PMRF from Govt. of India and Kotak AI Center at IISc.
@misc{rishubh2025monoplace3D,
title={MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection},
author={Rishubh Parihar, Srinjay Sarkar, Sarthak Vora, Jogendra Kundu, R. Venkatesh Babu},
journal={Conference on Computer Vision and Pattern Recognition},
year={2025},
}