REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Overview

We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model, achieving state-of-the-art FID scores of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256×256.

Through extensive evaluations, we demonstrate that our end-to-end training approach REPA-E offers four key advantages:

1. Accelerated Generation Performance: REPA-E significantly speeds up diffusion training by over 17× and 45× compared to REPA and vanilla training recipes, respectively, while achieving superior quality.

2. Improved VAE Latent-Space Structure: Joint tuning adaptively enhances latent space structure across different VAE architectures, addressing their specific limitations without explicit regularization.

3. Superior Drop-in VAE Replacements: The resulting E2E-VAE serves as a drop-in replacement for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures.

4. Effective From-Scratch Training: REPA-E enables joint training of both VAE and LDM from scratch, still achieving superior performance compared to traditional training approaches.

1. End-to-End Training Leads to Accelerated Generation Performance

Better performance with fewer epochs: REPA-E achieves gFID of 4.07 in just 80 epochs, significantly outperforming MaskDiT (5.69 with 1600 epochs) and FasterDiT (7.91 with 400 epochs)
Robust across architectures: Performance improvements remain consistent across different model scales (SiT-B/L/XL) and VAE architectures (SD-VAE, IN-VAE, VA-VAE), demonstrating the method's versatility
Enhanced image quality across training: Using identical noise and labels, REPA-E generates structurally superior images compared to REPA baseline at 50K, 100K, and 400K training iterations

Qualitative comparison between REPA and REPA-E at different training iterations

2. End-to-End Training Improves VAE Latent-Space Structure

Adaptive refinement without explicit regularization: REPA-E automatically adapts to each VAE's unique latent space characteristics without requiring manual heuristic-based regularization
PCA visualization: Using principal component analysis, we project VAE latents to RGB channels, revealing how end-to-end tuning with REPA-E improves latent representation quality
Architecture-specific benefits:

SD-VAE enhancement: Reduces high-frequency noise components for smoother latent representations
IN-VAE & VA-VAE enhancement: Adds essential structural details to over-smoothed latent representations

3. End-to-End Tuned VAEs as Superior Drop-in Replacements

Universal improvement: E2E-VAE serves as a drop-in replacement for original VAEs, delivering superior performance across diverse diffusion architectures
State-of-the-art generation quality: Achieves gFID of 1.26 (w/ CFG) and 1.83 (w/o CFG) when training with REPA for 800 epochs
Comprehensive performance superiority: Achieves gFID of 3.46 with SiT-XL and REPA (vs. 4.88 with VA-VAE and 7.90 with SD-VAE) and 4.20 with DiT-XL (vs. 4.71 with VA-VAE and 12.29 with SD-VAE)
Architecture-robust performance: E2E-VAE maintains strong generation quality across diffusion models with and without REPA (e.g., gFID 3.46 on SiT-XL w/ REPA and 4.20 on DiT-XL w REPA)

4. Enables Effective From-Scratch Training

End-to-end training from scratch: REPA-E can jointly train both VAE and LDM from scratch in an end-to-end manner, without requiring VAE pre-training
Strong performance even without initialization: While initializing the VAE with pretrained weights helps slightly improve results, from-scratch training still achieves gFID of 4.34 at 80 epochs, significantly outperforming REPA (7.90)

Citation

@article{leng2025repae,
  title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  year={2025},
  journal={arXiv preprint arXiv:2504.10483},
}