MultiRef Team


Model | Real-world Data | Synthetic Data | ||||
---|---|---|---|---|---|---|
IQ | IF | SF | IQ | IF | SF | |
Gemini-2.0-Flash | 0.385 | 0.422 | 0.354 | 0.369 | 0.627 | 0.588 |
GPT-4o-mini | 0.466 | 0.530 | 0.514 | 0.438 | 0.632 | 0.616 |
GPT-4o | 0.432 | 0.624 | 0.613 | 0.406 | 0.668 | 0.659 |
Human-Human | 0.589 | 0.665 | 0.571 | 0.629 | 0.721 | 0.694 |
LLM+SD combinations achieve the highest image quality scores, with Claude + SD3.5 reaching 0.774, occasionally surpassing ground truth. However, all compositional frameworks consistently underperform in instruction following and source fidelity. While ground truth achieves 0.767 and 0.706 for IF and SF respectively, Claude + SD3.5 only reaches 0.589 and 0.462, indicating that separated perceptor-generator architecture fundamentally compromises complex visual instruction execution.
Although unified models theoretically end-to-end advantage contributes to maintaining consistency, they underperform in fidelity preservation. OmniGen’s performance in various metrics even approaches some compositional frameworks that generate images with state-of-theart diffusion models, demonstrating its effectiveness in balancing quality with instruction adherence. However, all models still fall short when compared with the golden answer (created with professional software), highlighting significant room for improvement in real-world image generation scenarios.
Even advanced models like ACE, despite strong performance in specific areas (Bbox: 0.219, Pose: 0.090), show substantial gaps in reference fidelity compared to Ground Truth. While unified end-to-end architectures offer greater potential than compositional frameworks, both struggle with complex reference combinations or image generation without captions, highlighting the need for improved generalization in multi-image generation.
Our ablation study presents results investigating how different input formats for Bounding Box (BBox), Depth, and Mask conditions affect the generation performance of three models. There is no universally superior format that works best across all tested models. Instead, each model often exhibits a distinct preference. ACE and ChatDiT show more robust performance on the depth and mask format. For Depth MSE, ACE performs significantly better with “ori depth," whereas OmniGen and ChatDiT show slightly better or comparable performance with “color depth".
Switching the input order resulted in only minor FID improvements or no change for the models. However, this operation had more substantial and often model-specific impacts on adherence to particular conditions. For example, ACE’s depth error and sketch error both increased dramatically when the order was switched. These observations suggest that the sequence of processing conditions is more critical for controlling specific visual attributes than for overall image realism as measured by FID. The presence of captions also improves depth fidelity and aesthetic quality across all models. For instance, ACE’s depth error significantly increases when captions are removed. However, for semantic map fidelity, OmniGen and ACE perform better without captions. Similarly, sketch fidelity improves for all three models when captions are absent, with ACE showing a notable reduction in sketch error.