MultiRef: Controllable Image Generation with Multiple Visual References

Abstract

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs — either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world generation samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 ref- erence combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facil- itate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of- the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively inte- grate multiple sources of visual inspiration.

MultiRef Dataset Construction

We introduce **MultiRef**, a comprehensive dataset for multi-reference image generation, comprising 38,076 high-quality images designed to facilitate controllable image generation using multiple visual references. The dataset encompasses diverse reference types including depth maps, semantic segmentation, Canny edges, sketches, bounding boxes, masks, human poses, art styles, subjects, and textual captions, with 33 different reference combinations across 10 distinct reference modalities. These references span various visual domains from architectural photography and artistic paintings to human-centric imagery and abstract compositions, enabling comprehensive evaluation of multi-reference conditioning capabilities. The construction of the MultiRef dataset follows a dual-pipeline methodology:

Real-World Data Collection from Community Platforms: We collect authentic multi-reference tasks from Reddit's r/PhotoshopRequest community, following established practices for real-world image editing scenarios. This process involves gathering 2,300 user-submitted queries that require combining multiple input images, followed by rigorous manual evaluation to ensure data quality by verifying image necessity, instruction coherence, and output accuracy. Each datapoint includes 2-6 input images, structured instructions, and professional-quality output images, with final selection yielding 1,000 high-quality real-world examples.
Synthetic Data Generation via RefBlend Engine: To address the scarcity of multi-reference training data, we develop RefBlend, a novel four-step synthetic data engine that automatically produces diverse samples. The process includes: (1) extracting comprehensive visual references from original images using state-of-the-art models (e.g., Grounded SAM2, Depth Anything2), (2) establishing compatibility rules and generating valid reference combinations based on visual reference dependencies, (3) creating both structured and enhanced instruction prompts using template-based approaches and GPT-4o persona-driven variations, and (4) implementing a high-quality filtering pipeline combining rule-based metrics and fine-tuned MLLM-as-a-Judge assessment to ensure only the most relevant and effective examples are included in the final dataset.

Benchmark

We conduct evaluations on three state-of-the-art unified image generation models: OmniGen, ACE, and Show-o. Additionally, we evaluate six compositional frameworks that leverage large language models as perceptors combined with diffusion models as generators, including ChatDiT, Claude + SD (versions 2.1, 3, 3.5), and Gemini + SD (versions 2.1, 3, 3.5).

For multi-reference conditioning, unified models implement multi-turn dialogues where each conversational turn incorporates one reference image, while compositional frameworks process all references simultaneously through structured prompts.

Our evaluation employs a comprehensive three-dimensional assessment framework: (1) Reference Fidelity, measuring how accurately generated images preserve specific attributes from provided references using both rule-based metrics (IoU, MSE) and model-based assessments; (2) Image Quality, evaluating visual fidelity through FID scores and aesthetic appeal via CLIP aesthetic scores; and (3) Overall Assessment, utilizing fine-tuned MLLM-as-a-Judge (GPT-4o-mini) to holistically evaluate Image Quality (IQ), Instruction Following (IF), and Source Fidelity (SF) on a 1-5 scale.

For reference-specific evaluation, we employ specialized metrics tailored to each modality: IoU for spatial constraints (bounding boxes, semantic maps, masks), MSE for structural fidelity (depth maps, Canny edges, sketches), CLIP scores for semantic consistency (subjects, art styles), and mAP for pose accuracy. All metrics are normalized to [0,1] range for consistency.

We evaluate on both synthetic samples (990 examples) and real-world tasks (1,000 examples) to ensure comprehensive assessment across diverse multi-reference scenarios.

Model Type	Model	Overall Assessment	Image Quality	Reference Fidelity
Unified Models	Show-o	0.764	0.616	0.462	0.110	0.607	0.051	0.263	0.332	0.104	0.061	0.203	0.569	0.008	0.532	0.301
OmniGen	0.730	0.532	0.438	0.111	0.593	0.179	0.197	0.320	0.087	0.092	0.221	0.382	0.014	0.623	0.329
ACE	0.740	0.655	0.528	0.108	0.592	0.219	0.382	0.439	0.044	0.079	0.112	0.521	0.090	0.720	0.397
Compositional Frameworks	ChatDiT	0.811	0.713	0.574	0.100	0.559	0.128	0.176	0.393	0.088	0.065	0.207	0.543	0.018	0.855	0.369
Claude + SD 2.1	0.812	0.726	0.572	0.114	0.612	0.174	0.132	0.292	0.203	0.080	0.230	0.547	0.005	0.817	0.424
Claude + SD 3	0.876	0.817	0.658	0.102	0.635	0.134	0.145	0.360	0.203	0.087	0.215	0.576	0.009	0.859	0.420
Claude + SD 3.5	0.913	0.853	0.691	0.111	0.647	0.124	0.147	0.358	0.082	0.082	0.213	0.573	0.009	0.858	0.434
Gemini + SD 2.1	0.791	0.708	0.547	0.113	0.615	0.161	0.133	0.255	0.202	0.092	0.239	0.550	0.003	0.791	0.406
Gemini + SD 3	0.856	0.804	0.639	0.103	0.635	0.141	0.135	0.368	0.083	0.121	0.216	0.581	0.008	0.840	0.414
Gemini + SD 3.5	0.893	0.839	0.676	0.111	0.646	0.132	0.130	0.371	0.077	0.096	0.216	0.579	0.008	0.845	0.422
Ground Truth	-	0.842	0.803	0.668	0.108	0.617	0.410	0.772	0.893	0.000	0.000	0.000	0.584	0.149	0.869	0.417

Model Type	Model	Element Addition	Spatial Manipulations	Element Replacement	Attribute Transfer	Style Modifications	Overall
Unified Models	Show-o	0.511	0.290	0.253	0.525	0.300	0.258	0.508	0.268	0.240	0.548	0.301	0.260	0.473	0.307	0.259	0.513	0.293	0.254
OmniGen	0.553	0.498	0.429	0.553	0.461	0.422	0.484	0.450	0.379	0.567	0.479	0.408	0.620	0.590	0.468	0.555	0.496	0.421
ACE	0.254	0.207	0.205	0.260	0.207	0.205	0.255	0.207	0.203	0.234	0.200	0.200	0.265	0.205	0.200	0.254	0.205	0.203
Compositional Frameworks	ChatDiT	0.629	0.390	0.345	0.643	0.411	0.352	0.643	0.434	0.360	0.682	0.466	0.395	0.688	0.522	0.424	0.657	0.445	0.375
Gemini + SD 2.1	0.611	0.372	0.329	0.620	0.404	0.324	0.574	0.391	0.339	0.605	0.397	0.332	0.660	0.495	0.385	0.614	0.412	0.342
Claude + SD 2.1	0.620	0.402	0.330	0.625	0.416	0.339	0.555	0.371	0.322	0.674	0.419	0.345	0.717	0.507	0.390	0.638	0.423	0.345
Gemini + SD 3	0.764	0.590	0.478	0.729	0.589	0.453	0.725	0.540	0.452	0.715	0.556	0.452	0.785	0.640	0.485	0.744	0.583	0.464
Claude + SD 3	0.744	0.578	0.454	0.751	0.586	0.456	0.675	0.497	0.408	0.745	0.556	0.441	0.795	0.629	0.478	0.742	0.569	0.447
Gemini + SD 3.5	0.786	0.615	0.500	0.756	0.591	0.473	0.759	0.558	0.459	0.789	0.564	0.441	0.780	0.610	0.460	0.774	0.588	0.467
Claude + SD 3.5	0.767	0.563	0.469	0.777	0.598	0.472	0.700	0.506	0.406	0.789	0.625	0.466	0.790	0.654	0.498	0.765	0.589	0.462
Ground Truth	-	0.711	0.797	0.712	0.751	0.780	0.748	0.651	0.714	0.624	0.772	0.722	0.692	0.780	0.820	0.756	0.733	0.767	0.706

Model Type

Model

Element Addition

Spatial Manipulations

Element Replacement

Attribute Transfer

Style Modifications

Overall

Unified Models

Show-o

0.511

0.290

0.253

0.525

0.300

0.258

0.508

0.268

0.240

0.548

0.301

0.260

0.473

0.307

0.259

0.513

0.293

0.254

OmniGen

0.553

0.498

0.429

0.553

0.461

0.422

0.484

0.450

0.379

0.567

0.479

0.408

0.620

0.590

0.468

0.555

0.496

0.421

ACE

0.254

0.207

0.205

0.260

0.207

0.205

0.255

0.207

0.203

0.234

0.200

0.265

0.205

0.200

0.254

0.205

0.203

Compositional Frameworks

ChatDiT

0.629

0.390

0.345

0.643

0.411

0.352

0.643

0.434

0.360

0.682

0.466

0.395

0.688

0.522

0.424

0.657

0.445

0.375

Gemini + SD 2.1

0.611

0.372

0.329

0.620

0.404

0.324

0.574

0.391

0.339

0.605

0.397

0.332

0.660

0.495

0.385

0.614

0.412

0.342

Claude + SD 2.1

0.620

0.402

0.330

0.625

0.416

0.339

0.555

0.371

0.322

0.674

0.419

0.345

0.717

0.507

0.390

0.638

0.423

0.345

Gemini + SD 3

0.764

0.590

0.478

0.729

0.589

0.453

0.725

0.540

0.452

0.715

0.556

0.452

0.785

0.640

0.485

0.744

0.583

0.464

Claude + SD 3

0.744

0.578

0.454

0.751

0.586

0.456

0.675

0.497

0.408

0.745

0.556

0.441

0.795

0.629

0.478

0.742

0.569

0.447

Gemini + SD 3.5

0.786

0.615

0.500

0.756

0.591

0.473

0.759

0.558

0.459

0.789

0.564

0.441

0.780

0.610

0.460

0.774

0.588

0.467

Claude + SD 3.5

0.767

0.563

0.469

0.777

0.598

0.472

0.700

0.506

0.406

0.789

0.625

0.466

0.790

0.654

0.498

0.765

0.589

0.462

Ground Truth

0.711

0.797

0.712

0.751

0.780

0.748

0.651

0.714

0.624

0.772

0.722

0.692

0.780

0.820

0.756

0.733

0.767

0.706

Model	Real-world Data	Synthetic Data
Gemini-2.0-Flash	0.385	0.422	0.354	0.369	0.627	0.588
GPT-4o-mini	0.466	0.530	0.514	0.438	0.632	0.616
GPT-4o	0.432	0.624	0.613	0.406	0.668	0.659
Human-Human	0.589	0.665	0.571	0.629	0.721	0.694

Model

Real-world Data

Synthetic Data

Gemini-2.0-Flash

0.385

0.422

0.354

0.369

0.627

0.588

GPT-4o-mini

0.466

0.530

0.514

0.438

0.632

0.616

GPT-4o

0.432

0.624

0.613

0.406

0.668

0.659

Human-Human

0.589

0.665

0.571

0.629

0.721

0.694

Empirical Results

Model	Setting	Image Quality	Reference Fidelity
OmniGen	Original	0.114	0.588	0.267	0.272	0.273	0.062	0.098	0.216	0.014	0.193	0.735	0.565
Switch order	0.114	0.579	0.382	0.315	0.290	0.068	0.105	0.219	0.005	0.195	0.737	0.556
w/o caption	0.120	0.534	0.180	0.308	0.272	0.081	0.137	0.191	-	0.202	0.669	0.581
ACE	Original	0.114	0.597	0.326	0.296	0.311	0.037	0.089	0.120	0.089	0.191	0.715	0.552
Switch order	0.112	0.598	0.303	0.243	0.386	0.077	0.105	0.222	0.036	0.191	0.802	0.600
w/o caption	0.114	0.567	0.231	0.470	0.481	0.019	0.111	0.069	-	0.197	0.657	0.567
ChatDiT	Original	0.107	0.560	0.147	0.160	0.261	0.098	0.065	0.227	0.022	0.194	0.818	0.541
Switch order	0.105	0.574	0.125	0.150	0.284	0.092	0.063	0.220	0.022	0.194	0.830	0.556
w/o caption	0.096	0.550	0.142	0.132	0.278	0.113	0.066	0.196	-	0.202	0.836	0.553

Compositional framework exceeds in image quality, while failing to maintain consistency on real-world cases.

LLM+SD combinations achieve the highest image quality scores, with Claude + SD3.5 reaching 0.774, occasionally surpassing ground truth. However, all compositional frameworks consistently underperform in instruction following and source fidelity. While ground truth achieves 0.767 and 0.706 for IF and SF respectively, Claude + SD3.5 only reaches 0.589 and 0.462, indicating that separated perceptor-generator architecture fundamentally compromises complex visual instruction execution.

Unified models struggled with generation quality and handling real-world images.

Although unified models theoretically end-to-end advantage contributes to maintaining consistency, they underperform in fidelity preservation. OmniGen’s performance in various metrics even approaches some compositional frameworks that generate images with state-of-theart diffusion models, demonstrating its effectiveness in balancing quality with instruction adherence. However, all models still fall short when compared with the golden answer (created with professional software), highlighting significant room for improvement in real-world image generation scenarios.

Controllable image generation from multiple references is challenging.

Even advanced models like ACE, despite strong performance in specific areas (Bbox: 0.219, Pose: 0.090), show substantial gaps in reference fidelity compared to Ground Truth. While unified end-to-end architectures offer greater potential than compositional frameworks, both struggle with complex reference combinations or image generation without captions, highlighting the need for improved generalization in multi-image generation.

Models show strong and varied preferences for reference formats.

Our ablation study presents results investigating how different input formats for Bounding Box (BBox), Depth, and Mask conditions affect the generation performance of three models. There is no universally superior format that works best across all tested models. Instead, each model often exhibits a distinct preference. ACE and ChatDiT show more robust performance on the depth and mask format. For Depth MSE, ACE performs significantly better with “ori depth," whereas OmniGen and ChatDiT show slightly better or comparable performance with “color depth".

Input order primarily influences specific conditional fidelities rather than global image quality.

Switching the input order resulted in only minor FID improvements or no change for the models. However, this operation had more substantial and often model-specific impacts on adherence to particular conditions. For example, ACE’s depth error and sketch error both increased dramatically when the order was switched. These observations suggest that the sequence of processing conditions is more critical for controlling specific visual attributes than for overall image realism as measured by FID. The presence of captions also improves depth fidelity and aesthetic quality across all models. For instance, ACE’s depth error significantly increases when captions are removed. However, for semantic map fidelity, OmniGen and ACE perform better without captions. Similarly, sketch fidelity improves for all three models when captions are absent, with ACE showing a notable reduction in sketch error.

MultiRef: Controllable Image Generation with Multiple Visual References

Abstract

MultiRef Dataset Construction

Benchmark

Empirical Results

Compositional framework exceeds in image quality, while failing to maintain consistency on real-world cases.

Unified models struggled with generation quality and handling real-world images.

Controllable image generation from multiple references is challenging.

Models show strong and varied preferences for reference formats.

Input order primarily influences specific conditional fidelities rather than global image quality.

Acknowledgement

BibTeX

MultiRef Team

Model Type	Model	Overall Assessment			Image Quality		Reference Fidelity
Model Type	Model	IQ	IF	SF	FID↓	Aesthetic↑	BBox↑	Semantic Map↑	Mask↑	Depth↓	Canny↓	Sketch↓	Caption↑	Pose↑	Subject↑	Art Style↑
Unified Models	Show-o	0.764	0.616	0.462	0.110	0.607	0.051	0.263	0.332	0.104	0.061	0.203	0.569	0.008	0.532	0.301
	OmniGen	0.730	0.532	0.438	0.111	0.593	0.179	0.197	0.320	0.087	0.092	0.221	0.382	0.014	0.623	0.329
	ACE	0.740	0.655	0.528	0.108	0.592	0.219	0.382	0.439	0.044	0.079	0.112	0.521	0.090	0.720	0.397
Compositional Frameworks	ChatDiT	0.811	0.713	0.574	0.100	0.559	0.128	0.176	0.393	0.088	0.065	0.207	0.543	0.018	0.855	0.369
	Claude + SD 2.1	0.812	0.726	0.572	0.114	0.612	0.174	0.132	0.292	0.203	0.080	0.230	0.547	0.005	0.817	0.424
	Claude + SD 3	0.876	0.817	0.658	0.102	0.635	0.134	0.145	0.360	0.203	0.087	0.215	0.576	0.009	0.859	0.420
	Claude + SD 3.5	0.913	0.853	0.691	0.111	0.647	0.124	0.147	0.358	0.082	0.082	0.213	0.573	0.009	0.858	0.434
	Gemini + SD 2.1	0.791	0.708	0.547	0.113	0.615	0.161	0.133	0.255	0.202	0.092	0.239	0.550	0.003	0.791	0.406
	Gemini + SD 3	0.856	0.804	0.639	0.103	0.635	0.141	0.135	0.368	0.083	0.121	0.216	0.581	0.008	0.840	0.414
	Gemini + SD 3.5	0.893	0.839	0.676	0.111	0.646	0.132	0.130	0.371	0.077	0.096	0.216	0.579	0.008	0.845	0.422
Ground Truth	-	0.842	0.803	0.668	0.108	0.617	0.410	0.772	0.893	0.000	0.000	0.000	0.584	0.149	0.869	0.417