AssetDropper

Abstract

Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first generative framework designed to extract any asset from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to achieve a closed loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction.

Method Overview

Overview of the AssetDropper framework. For the input reference image x_r, we obtain the asset's mask m through Grounding DINO, and further obtain the masked image x_m and the edge map x_e of the asset through the mask. The corresponding standardized asset is denoted by x_a. Networks with the same color represent the same model (blue for SDXL UNet, red for SDXL-Inpainting UNet). We first use FeatureNet, a UNet that encodes the low-level features of the input x_r, m, and x_e. We then use IP-Adapter to encode high-level semantics of the masked image x_m. ExtractNet is the main UNet that serves as our generator, processing noisy data x_a,t. A detailed text prompt of the asset is provided by GPT-4o for both FeatureNet and ExtractNet, where [V] denotes the subject, e.g., "a surreal fusion of cherries and skulls, blending natural and macabre elements." We train a model with the same architecture as AssetDropperNet to perform the inverse task, reattaching the extracted asset back to the reference image while masking the inpainted area. This model is then used as our pre-trained reward model.

Video

Results

Qualitative results on in-the-wild images. For each image block, the first row is the input reference image, and the second row is the output of AssetDropper.

Comparisons

Qualitative comparison in SAP synthetic test dataset. For the synthetic test dataset, we selected three different meshes corresponding to three varying levels of surface curvature. Our method achieves asset extraction results that are close to the ground truth.

BibTeX

@article{li2025assetdropper,
        title={AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization},
        author={Li, Lanjiong and Zhao, Guanhua and Zhu, Lingting and Cai, Zeyu and Yu, Lequan and Zhang, Jian and Wang, Zeyu},
        journal={arXiv preprint arXiv:2506.07738},
        year={2025}
      }

AssetDropper
Asset Extraction via Diffusion Models
with Reward-Driven Optimization

SIGGRAPH 2025

Abstract

Method Overview

Video

Results

Comparisons

BibTeX

AssetDropper Asset Extraction via Diffusion Models with Reward-Driven Optimization

SIGGRAPH 2025

Abstract

Method Overview

Video

Results

Comparisons

BibTeX

AssetDropper
Asset Extraction via Diffusion Models
with Reward-Driven Optimization