AssetDropper
Asset Extraction via Diffusion Models
with Reward-Driven Optimization

SIGGRAPH 2025



1The Hong Kong University of Science and Technology (Guangzhou)    2Peking University    3The University of Hong Kong    4The Hong Kong University of Science and Technology   
*Equal Contribution    Project Lead    Corresponding Author   


AssetDropper, a novel model designed to extract assets from user-specified image regions. The extracted assets can seamlessly be applied to various downstream tasks in design and visualization workflows, e.g., 2D mockup creation and 3D model texturing.


Abstract

Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first generative framework designed to extract any asset from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to achieve a closed loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction.

Method Overview


Overview of the AssetDropper framework. For the input reference image xr, we obtain the asset's mask m through Grounding DINO, and further obtain the masked image xm and the edge map xe of the asset through the mask. The corresponding standardized asset is denoted by xa. Networks with the same color represent the same model (blue for SDXL UNet, red for SDXL-Inpainting UNet). We first use FeatureNet, a UNet that encodes the low-level features of the input xr, m, and xe. We then use IP-Adapter to encode high-level semantics of the masked image xm. ExtractNet is the main UNet that serves as our generator, processing noisy data xa,t. A detailed text prompt of the asset is provided by GPT-4o for both FeatureNet and ExtractNet, where [V] denotes the subject, e.g., "a surreal fusion of cherries and skulls, blending natural and macabre elements." We train a model with the same architecture as AssetDropperNet to perform the inverse task, reattaching the extracted asset back to the reference image while masking the inpainted area. This model is then used as our pre-trained reward model.

Video

Results


Qualitative results on in-the-wild images. For each image block, the first row is the input reference image, and the second row is the output of AssetDropper.

Comparisons


Qualitative comparison in SAP synthetic test dataset. For the synthetic test dataset, we selected three different meshes corresponding to three varying levels of surface curvature. Our method achieves asset extraction results that are close to the ground truth.