We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model, achieving state-of-the-art FID scores of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256×256.
1. Accelerated Generation Performance: REPA-E significantly speeds up diffusion training by over 17× and 45× compared to REPA and vanilla training recipes, respectively, while achieving superior quality.
2. Improved VAE Latent-Space Structure: Joint tuning adaptively enhances latent space structure across different VAE architectures, addressing their specific limitations without explicit regularization.
3. Superior Drop-in VAE Replacements: The resulting E2E-VAE serves as a drop-in replacement for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures.
4. Effective From-Scratch Training: REPA-E enables joint training of both VAE and LDM from scratch, still achieving superior performance compared to traditional training approaches.
@article{leng2025repae,
title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
year={2025},
journal={arXiv preprint arXiv:2504.10483},
}