We show that latent diffusion models and their VAE tokenizer can be effectively trained end-to-end using a simple representation-alignment (REPA) loss. REPA-E achieves state-of-the-art FID scores of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256×256.
End-to-end training dramatically accelerates convergence. REPA-E achieves 17× speedup over REPA and 45× over vanilla training while achieving superior generation quality.
Joint training adaptively improves VAE latent structure. The VAE learns to produce latents better suited for the diffusion model's denoising task.
E2E-VAE serves as superior drop-in replacement. Achieving SOTA FID of 1.12 (w/ CFG) and 1.69 (w/o CFG) on ImageNet 256×256.
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")
Full workflow for encoding and decoding images:
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to("cuda")
Full workflow for encoding and decoding images:
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.35.0 torch>=2.5.0
Loading the VAE is as easy as:
from diffusers import AutoencoderKLQwenImage
vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")
Full workflow for encoding and decoding images (note the frame dimension handling):
from io import BytesIO
import requests
from diffusers import AutoencoderKLQwenImage
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)
# Add frame dimension (required for QwenImage VAE)
image_ = image.unsqueeze(2)
with torch.no_grad():
latents = vae.encode(image_).latent_dist.sample()
reconstructed = vae.decode(latents).sample
# Remove frame dimension
latents = latents.squeeze(2)
reconstructed = reconstructed.squeeze(2)
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
For complete usage examples and integration with diffusion models, see the individual model cards on Hugging Face.
We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model, achieving state-of-the-art FID scores of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256×256.
Through extensive evaluations, we demonstrate that our end-to-end training approach REPA-E offers four key advantages:
REPA-E dramatically accelerates diffusion model training while achieving superior generation quality. We demonstrate consistent improvements across different model scales and VAE architectures.
| Method | Tokenizer | Epochs | gFID↓ | sFID↓ | IS↑ |
|---|---|---|---|---|---|
| Without End-to-End Tuning | |||||
| MaskDiT [54] | SD-VAE | 1600 | 5.69 | 10.34 | 177.9 |
| DiT [34] | 1400 | 9.62 | 6.85 | 121.5 | |
| SiT [30] | 1400 | 8.61 | 6.32 | 131.7 | |
| FasterDiT [49] | 400 | 7.91 | 5.45 | 131.3 | |
| REPA [52] | SD-VAE | 20 | 19.40 | 6.06 | 67.4 |
| 40 | 11.10 | 6.06 | 67.4 | ||
| 80 | 7.90 | 5.06 | 122.6 | ||
| 800 | 5.90 | 5.73 | 157.8 | ||
| With End-to-End Tuning (Ours) | |||||
| REPA-E | SD-VAE* | 20 | 12.83 | 5.04 | 88.8 |
| 40 | 7.17 | 4.39 | 123.7 | ||
| 80 | 4.07 | 4.60 | 161.8 | ||
| Diff. Model | gFID↓ | sFID↓ | IS↑ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| SiT-B (130M) | 49.5 | 7.00 | 27.5 | 0.46 | 0.59 |
| +REPA-E (Ours) | 34.8 | 6.31 | 39.1 | 0.57 | 0.59 |
| SiT-L (458M) | 24.1 | 6.25 | 55.7 | 0.62 | 0.60 |
| +REPA-E (Ours) | 16.3 | 5.69 | 75.0 | 0.68 | 0.60 |
| SiT-XL (675M) | 19.4 | 6.06 | 67.4 | 0.64 | 0.61 |
| +REPA-E (Ours) | 12.8 | 5.04 | 88.8 | 0.71 | 0.58 |
| Autoencoder | gFID↓ | sFID↓ | IS↑ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| SD-VAE [39] | 24.1 | 6.25 | 55.7 | 0.62 | 0.60 |
| +REPA-E (Ours) | 16.3 | 5.69 | 75.0 | 0.68 | 0.60 |
| IN-VAE (f16d32) | 22.7 | 5.47 | 56.0 | 0.62 | 0.62 |
| +REPA-E (Ours) | 12.7 | 5.57 | 84.0 | 0.69 | 0.62 |
| VA-VAE [48] | 12.8 | 6.47 | 83.8 | 0.71 | 0.58 |
| +REPA-E (Ours) | 11.1 | 5.31 | 88.8 | 0.72 | 0.61 |
End-to-end training with REPA-E adaptively refines the VAE's latent space structure without explicit regularization. Different VAE architectures exhibit different failure modes, and REPA-E addresses each appropriately.
The end-to-end tuned E2E-VAE serves as a universal drop-in replacement for standard VAEs, delivering consistent improvements across diverse diffusion model architectures without requiring any modifications to the training pipeline.
REPA-E enables effective joint training of both VAE and LDM from scratch, eliminating the need for separate VAE pre-training while still achieving superior performance compared to traditional approaches.
| Method | gFID↓ | sFID↓ | IS↑ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| 100K Iterations (20 Epochs) | |||||
| REPA [52] | 19.40 | 6.06 | 67.4 | 0.64 | 0.61 |
| REPA-E (scratch) | 14.12 | 7.87 | 83.5 | 0.70 | 0.59 |
| REPA-E (VAE init.) | 12.83 | 5.04 | 88.8 | 0.71 | 0.58 |
| 200K Iterations (40 Epochs) | |||||
| REPA [52] | 11.10 | 5.05 | 100.4 | 0.69 | 0.64 |
| REPA-E (scratch) | 7.54 | 6.17 | 120.4 | 0.74 | 0.61 |
| REPA-E (VAE init.) | 7.17 | 4.39 | 123.7 | 0.74 | 0.62 |
| 400K Iterations (80 Epochs) | |||||
| REPA [52] | 7.90 | 5.06 | 122.6 | 0.70 | 0.65 |
| REPA-E (scratch) | 4.34 | 4.44 | 154.3 | 0.75 | 0.63 |
| REPA-E (VAE init.) | 4.07 | 4.60 | 161.8 | 0.76 | 0.62 |
We introduced REPA-E, a method for end-to-end training of latent diffusion models and their VAE tokenizers. Our key findings demonstrate that:
These results establish end-to-end training as a practical and effective approach for training latent diffusion models, opening new possibilities for joint architecture optimization and task-specific adaptations.
@article{leng2025repae,
title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
year={2025},
journal={arXiv preprint arXiv:2504.10483},
}
@misc{repaet2i2025,
title={Family of End-to-End Tuned VAEs for Supercharging T2I Diffusion Transformers},
author={Xingjian Leng and Jaskirat Singh and Ryan Murdock and Ethan Smith and Rebecca Li and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
howpublished={\url{https://end2end-diffusion.github.io/repa-e-t2i/}},
year={2025},
}
[1] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. CVPR.
[2] Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. ICCV.
[3] Yu, S., Sohn, K., Kim, S., & Shin, J. (2024). Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940.
[4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR.
[5] Esser, P., Kulal, S., Blattmann, A., et al. (2024). Scaling rectified flow transformers for high-resolution image synthesis. ICML.