Representation matters for generation. But what truly drives its effectiveness: global information or spatial structure? Prevailing wisdom says global information. We reveal a surprising finding: spatial structure, not global information, drives generation performance of a representation.
Higher global information does not mean better REPA performance. Target representations with >60% higher ImageNet accuracy can underperform for generation.
Spatial structure shows higher correlation with generation quality. Spatial metrics achieve |r| > 0.85 correlation with FID, while ImageNet accuracy shows only |r| = 0.26.
Accentuating transfer of spatial information consistently improves convergence speed. Our simple method (iREPA) improves REPA across diverse encoders, model sizes, and training recipes.
Representation alignment (REPA) accelerates diffusion model training by distilling knowledge from pretrained vision encoders to intermediate diffusion features. A fundamental question persists: what aspect of the target representation drives generation performance – its global semantic information (measured by ImageNet-1K accuracy) or its spatial structure (pairwise similarity between patch tokens)?
Prevailing wisdom holds that performance of a target representation for generation is heavily tied to its global semantic performance (e.g., ImageNet-1K accuracy).
The prevailing understanding suggests:
"When a diffusion transformer is aligned with a pretrained encoder that offers more semantically meaningful representations (i.e., better linear probing results), the model not only captures better semantics but also exhibits enhanced generation performance, as reflected by improved validation accuracy with linear probing and lower FID scores."
— REPA (Yu et al., 2024)
We challenge this view. Through large-scale empirical analysis across 27 vision encoders and multiple model scales, we uncover three surprising findings:
Click to jump to each section.
We identify four key trends in representation alignment that cannot be explained by global accuracy (ImageNet-1K performance). These observations challenge the conventional assumption that better classification accuracy implies better generation with REPA.
Recent vision encoders exhibit surprising inverse relationships between accuracy and generation quality. PE-Core-G (1.88B params, 82.8% accuracy) performs worse than PE-Spatial-B (80M params, 53.1% accuracy) with FID 32.3 vs 21.0. Similarly, WebSSL-1B (76.0% accuracy) underperforms PE-Spatial-B despite having 23% higher ImageNet accuracy, yielding FID 26.1 vs 21.0.
SAM2-S, with only 24.1% ImageNet accuracy, achieves better generation performance than encoders with ~60% higher accuracy. This tiny model (46M params) outperforms giants like PE-Core-G (82.8% accuracy) when used for REPA, demonstrating that global semantic understanding is not the key driver.
Contrary to expectations, larger model variants within the same encoder family often lead to similar (DINOv2) or worse (PE, C-RADIO) generation performance. Despite having better ImageNet accuracy, these larger models fail to improve—and sometimes harm—generation quality with REPA.
In controlled experiments mixing CLS tokens into patch representations (α ∈ [0, 0.5]), linear probing accuracy improves from 70.7% to 78.5%. Yet generation quality deteriorates significantly, with FID worsening from 19.2 to 25.4. This proves that injecting global information actively harms generation performance.
To quantify spatial structure, we measure the spatial self-similarity between patch tokens - essentially how similarity varies with spatial distance. We perform large-scale correlation analysis across 27 diverse vision encoders.
The correlation pattern holds consistently across different model sizes - SiT-B, SiT-L, and SiT-XL.
Surprisingly, yes. Classical spatial features like SIFT, HOG, and intermediate VGG features all lead to performance gains with REPA, providing further evidence that spatial structure alone drives effectiveness.
Yes. The spatial structure metrics can explain both the gains from baseline REPA and our improved iREPA method.
Building on the insight that spatial structure drives REPA performance, we introduce two straightforward modifications to enhance the transfer of spatial information from target representation to diffusion features.
The standard REPA uses a 3-layer MLP to map diffusion features to target representation dimensions. However, this MLP projection is lossy and diminishes spatial contrast between patch tokens. We replace it with a lightweight convolutional layer (kernel size 3) that naturally preserves local spatial relationships.
Patch tokens of pretrained vision encoders contain a global component that limits spatial contrast. This causes tokens in one semantic region (e.g., tomato) to show high similarity with unrelated tokens (e.g., background). We introduce spatial normalization that sacrifices this global information to enhance spatial contrast between patches:
y = (x - 𝔼[x]) / √(Var[x] + ε)
where expectations are computed across the spatial dimension.
Together, these two modifications significantly enhance the spatial structure of diffusion features, leading to consistently improved convergence speed across diverse encoders and model sizes.
Despite requiring less than 4 lines of code change, iREPA consistently improves convergence speed across diverse encoders (DINOv2, CLIP, WebSSL, PE-Core), model sizes (SiT-B, SiT-L, SiT-XL), and training recipes including REPA-E and Meanflow with REPA.
iREPA consistently achieves faster convergence across diverse vision encoders and model sizes.
iREPA improvements generalize to different training recipes including REPA-E and MeanFlow with REPA.
| Encoder | IS↑ | FID↓ | sFID↓ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| WebSSL-1B | 52.8 | 26.5 | 5.20 | 0.620 | 0.585 |
| +iREPA-E | 87.0 | 13.2 | 5.28 | 0.699 | 0.598 |
| PE-G | 50.9 | 25.9 | 5.68 | 0.612 | 0.576 |
| +iREPA-E | 80.0 | 16.4 | 5.40 | 0.667 | 0.616 |
| DINOv3-B | 82.2 | 14.4 | 4.68 | 0.694 | 0.596 |
| +iREPA-E | 93.6 | 11.7 | 4.57 | 0.703 | 0.613 |
| Encoder | w/o CFG | w/ CFG (2.0) | ||||||
|---|---|---|---|---|---|---|---|---|
| 4 NFE | 1 NFE | 4 NFE | 1 NFE | |||||
| IS↑ | FID↓ | IS↑ | FID↓ | IS↑ | FID↓ | IS↑ | FID↓ | |
| WebSSL-1B | 27.2 | 51.4 | 24.1 | 58.7 | 87.9 | 16.6 | 69.1 | 23.7 |
| +iREPA | 31.5 | 45.7 | 27.3 | 55.7 | 100.7 | 13.9 | 78.7 | 20.7 |
| DINOv3-B | 28.4 | 49.6 | 25.5 | 57.0 | 93.3 | 15.6 | 72.4 | 22.6 |
| +iREPA | 33.6 | 44.5 | 29.7 | 53.8 | 124.5 | 11.1 | 98.9 | 17.3 |
Both spatial normalization and convolution projection contribute significantly to performance gains.
| Method | DINOv2-B | DINOv3-B | WebSSL-1B | PE-Core-G | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FID↓ | IS↑ | sFID↓ | FID↓ | IS↑ | sFID↓ | FID↓ | IS↑ | sFID↓ | FID↓ | IS↑ | sFID↓ | |
| Baseline REPA | 19.06 | 70.3 | 5.83 | 21.47 | 63.4 | 6.19 | 26.10 | 53.0 | 6.90 | 32.35 | 42.7 | 6.70 |
| iREPA (w/o spatial norm) | 18.52 | 73.3 | 6.11 | 17.76 | 74.7 | 5.81 | 21.17 | 64.6 | 6.27 | 24.97 | 57.4 | 6.21 |
| iREPA (w/o conv proj) | 17.66 | 72.8 | 6.03 | 18.28 | 70.8 | 6.18 | 18.44 | 71.0 | 6.22 | 21.72 | 61.5 | 6.26 |
| iREPA (full) | 16.96 | 77.9 | 6.26 | 16.26 | 78.8 | 6.14 | 16.66 | 77.5 | 6.18 | 18.19 | 75.0 | 6.03 |
We investigate what truly drives the effectiveness of representation alignment: global information or spatial structure of the target representation?
Through large-scale empirical analysis we uncover a surprising finding: spatial structure, not global information, drives the effectiveness of representation alignment.
We further study this by introducing two simple modifications which accentuate the transfer of spatial information from target representation to diffusion features.
Our simple method, termed iREPA, consistently improves convergence speed with REPA across diverse variations in vision encoders and training recipes.
We hope our work will motivate future research to revisit the fundamental working mechanism of representational alignment and how we can better leverage it for improved training of generative models.
@article{singh2025irepa,
title={{What matters for Representation Alignment: Global Information or Spatial Structure?}},
author={Singh, Jaskirat and Leng, Xingjian and Wu, Zongze and Zheng, Liang and Zhang, Richard and Shechtman, Eli and Xie, Saining},
journal={arXiv preprint},
year={2025}
}
[1] Yu, S., Sohn, K., Kim, S., & Shin, J. (2024). Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940.
[2] Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. ICCV.
[3] Oquab, M., et al. (2023). DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
[4] Ravi, N., et al. (2024). SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714.
[5] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR.