MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen2, Dongping Chen12, Siyuan Wu2, Sinan Wang2, Shiyun Lang2, Petr Sushku1, Gaoyang Jiang2, Yao Wan2, Ranjay Krishna1
1University of Washington,
2Huazhong University of Science and Technology,

MultiRef Overview MultiRef Benchmark Overview **MultiRef** introduces the first comprehensive benchmark for evaluating image generation models' ability to combine multiple visual references, revealing that even state-of-the-art models struggle significantly with multi-reference conditioning compared to single-reference tasks.
  1. Novel Task and Benchmark Creation: The paper introduces MultiRef-Bench, comprising 990 synthetic and 1,000 real-world generation samples that require incorporating visual content from multiple reference images. The benchmark includes a sophisticated synthetic data engine (RefBlend) that generates diverse training samples across 10 reference types (depth maps, sketches, masks, etc.) and 32 reference combinations, with compatibility rules to ensure meaningful combinations.
  2. Comprehensive Evaluation Framework: The authors develop a robust evaluation methodology combining rule-based metrics (IoU for spatial references, MSE for structural references) and a fine-tuned MLLM-as-a-Judge model for semantic assessments. Testing reveals that the best-performing model (OmniGen) achieves only 66.6% on synthetic samples and 79.0% on real-world cases compared to golden answers, exposing significant gaps in current models' multi-reference capabilities.
  3. Critical Findings on Model Limitations: While compositional frameworks (LLM + diffusion models) excel in image quality, they fail to maintain consistency with source images and instructions. Unified models show better controllability but struggle with generation quality. Most critically, all models face severe challenges when visual references are complexly mixed, often producing corrupted outputs, indicating that current architectures are not truly equipped for the multi-reference creative processes inherent to human artistic expression.

Abstract

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs — either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world generation samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 ref- erence combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facil- itate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of- the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively inte- grate multiple sources of visual inspiration.

MultiRef Dataset Construction

environment infrastructure We introduce **MultiRef**, a comprehensive dataset for multi-reference image generation, comprising 38,076 high-quality images designed to facilitate controllable image generation using multiple visual references. The dataset encompasses diverse reference types including depth maps, semantic segmentation, Canny edges, sketches, bounding boxes, masks, human poses, art styles, subjects, and textual captions, with 33 different reference combinations across 10 distinct reference modalities. These references span various visual domains from architectural photography and artistic paintings to human-centric imagery and abstract compositions, enabling comprehensive evaluation of multi-reference conditioning capabilities. The construction of the MultiRef dataset follows a dual-pipeline methodology:
  1. Real-World Data Collection from Community Platforms: We collect authentic multi-reference tasks from Reddit's r/PhotoshopRequest community, following established practices for real-world image editing scenarios. This process involves gathering 2,300 user-submitted queries that require combining multiple input images, followed by rigorous manual evaluation to ensure data quality by verifying image necessity, instruction coherence, and output accuracy. Each datapoint includes 2-6 input images, structured instructions, and professional-quality output images, with final selection yielding 1,000 high-quality real-world examples.
  2. Synthetic Data Generation via RefBlend Engine: To address the scarcity of multi-reference training data, we develop RefBlend, a novel four-step synthetic data engine that automatically produces diverse samples. The process includes: (1) extracting comprehensive visual references from original images using state-of-the-art models (e.g., Grounded SAM2, Depth Anything2), (2) establishing compatibility rules and generating valid reference combinations based on visual reference dependencies, (3) creating both structured and enhanced instruction prompts using template-based approaches and GPT-4o persona-driven variations, and (4) implementing a high-quality filtering pipeline combining rule-based metrics and fine-tuned MLLM-as-a-Judge assessment to ensure only the most relevant and effective examples are included in the final dataset.

Benchmark

We conduct evaluations on three state-of-the-art unified image generation models: OmniGen, ACE, and Show-o. Additionally, we evaluate six compositional frameworks that leverage large language models as perceptors combined with diffusion models as generators, including ChatDiT, Claude + SD (versions 2.1, 3, 3.5), and Gemini + SD (versions 2.1, 3, 3.5).

For multi-reference conditioning, unified models implement multi-turn dialogues where each conversational turn incorporates one reference image, while compositional frameworks process all references simultaneously through structured prompts.

Our evaluation employs a comprehensive three-dimensional assessment framework: (1) Reference Fidelity, measuring how accurately generated images preserve specific attributes from provided references using both rule-based metrics (IoU, MSE) and model-based assessments; (2) Image Quality, evaluating visual fidelity through FID scores and aesthetic appeal via CLIP aesthetic scores; and (3) Overall Assessment, utilizing fine-tuned MLLM-as-a-Judge (GPT-4o-mini) to holistically evaluate Image Quality (IQ), Instruction Following (IF), and Source Fidelity (SF) on a 1-5 scale.

For reference-specific evaluation, we employ specialized metrics tailored to each modality: IoU for spatial constraints (bounding boxes, semantic maps, masks), MSE for structural fidelity (depth maps, Canny edges, sketches), CLIP scores for semantic consistency (subjects, art styles), and mAP for pose accuracy. All metrics are normalized to [0,1] range for consistency.

We evaluate on both synthetic samples (990 examples) and real-world tasks (1,000 examples) to ensure comprehensive assessment across diverse multi-reference scenarios.
Model Real-world Data Synthetic Data
IQ IF SF IQ IF SF
Gemini-2.0-Flash 0.385 0.422 0.354 0.369 0.627 0.588
GPT-4o-mini 0.466 0.530 0.514 0.438 0.632 0.616
GPT-4o 0.432 0.624 0.613 0.406 0.668 0.659
Human-Human 0.589 0.665 0.571 0.629 0.721 0.694

Empirical Results

Compositional framework exceeds in image quality, while failing to maintain consistency on real-world cases.

LLM+SD combinations achieve the highest image quality scores, with Claude + SD3.5 reaching 0.774, occasionally surpassing ground truth. However, all compositional frameworks consistently underperform in instruction following and source fidelity. While ground truth achieves 0.767 and 0.706 for IF and SF respectively, Claude + SD3.5 only reaches 0.589 and 0.462, indicating that separated perceptor-generator architecture fundamentally compromises complex visual instruction execution.

Unified models struggled with generation quality and handling real-world images.

Although unified models theoretically end-to-end advantage contributes to maintaining consistency, they underperform in fidelity preservation. OmniGen’s performance in various metrics even approaches some compositional frameworks that generate images with state-of-theart diffusion models, demonstrating its effectiveness in balancing quality with instruction adherence. However, all models still fall short when compared with the golden answer (created with professional software), highlighting significant room for improvement in real-world image generation scenarios.

Controllable image generation from multiple references is challenging.

Even advanced models like ACE, despite strong performance in specific areas (Bbox: 0.219, Pose: 0.090), show substantial gaps in reference fidelity compared to Ground Truth. While unified end-to-end architectures offer greater potential than compositional frameworks, both struggle with complex reference combinations or image generation without captions, highlighting the need for improved generalization in multi-image generation.

Models show strong and varied preferences for reference formats.

Our ablation study presents results investigating how different input formats for Bounding Box (BBox), Depth, and Mask conditions affect the generation performance of three models. There is no universally superior format that works best across all tested models. Instead, each model often exhibits a distinct preference. ACE and ChatDiT show more robust performance on the depth and mask format. For Depth MSE, ACE performs significantly better with “ori depth," whereas OmniGen and ChatDiT show slightly better or comparable performance with “color depth".

Input order primarily influences specific conditional fidelities rather than global image quality.

Switching the input order resulted in only minor FID improvements or no change for the models. However, this operation had more substantial and often model-specific impacts on adherence to particular conditions. For example, ACE’s depth error and sketch error both increased dramatically when the order was switched. These observations suggest that the sequence of processing conditions is more critical for controlling specific visual attributes than for overall image realism as measured by FID. The presence of captions also improves depth fidelity and aesthetic quality across all models. For instance, ACE’s depth error significantly increases when captions are removed. However, for semantic map fidelity, OmniGen and ACE perform better without captions. Similarly, sketch fidelity improves for all three models when captions are absent, with ACE showing a notable reduction in sketch error.

Acknowledgement

Many thanks to Jieyu Zhang for his invaluable effort in this project.

BibTeX

MultiRef Team