REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
End-to-end training for better and efficient training of latent diffusion models
End-to-end training is the keystone of modern deep learning. However, diffusion training remains two-stage: stage 1 for representation (VAE/RAE) and stage 2 for generation (DiT/SiT/JiT). Our mission is to enable end-to-end training for both visual representation and generation: allowing both representations and generation to be advanced together in an end-to-end manner.
End-to-end training for better and efficient training of latent diffusion models
Family of end-to-end tuned VAEs showing superior T2I performance across all benchmarks (COCO30k, DPG-Bench, GenAI-Bench, GenEval, MJHQ30k)
Large-scale empirical analysis revealing that spatial structure, not global information, drives generation performance of external representations
Training and evaluation code for REPA-E models
Family of end-to-end models for supercharging T2I training
If you are interested in our research on end-to-end diffusion models and visual representation learning, we would love to hear from you. Whether you want to collaborate, discuss our work, or explore new research directions, feel free to reach out.
Contact Us