REPA-E: Unlocking VAE for

End-to-End Tuning of Latent Diffusion Transformers

We show that latent diffusion models and their VAE tokenizer can be effectively trained end-to-end using a simple representation-alignment (REPA) loss. REPA-E achieves state-of-the-art FID scores of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256×256.

Speed Icon
17× Faster Training: REPA-E significantly accelerates diffusion training compared to REPA and 45× faster than vanilla training.
Quality Icon
SOTA Generation Quality: Achieves FID 1.12 (w/ CFG) and 1.69 (w/o CFG) on ImageNet 256×256.
Drop-in Icon
Drop-in VAE Replacements: E2E-VAE serves as a superior drop-in replacement across diverse diffusion architectures.
REPA-E Results Visualization

Key Findings

1

End-to-end training dramatically accelerates convergence. REPA-E achieves 17× speedup over REPA and 45× over vanilla training while achieving superior generation quality.

2

Joint training adaptively improves VAE latent structure. The VAE learns to produce latents better suited for the diffusion model's denoising task.

3

E2E-VAE serves as superior drop-in replacement. Achieving SOTA FID of 1.12 (w/ CFG) and 1.69 (w/o CFG) on ImageNet 256×256.

News

event [Oct 2025] Released REPA-E for T2I — a family of End-to-End Tuned VAEs:
  • Family of end-to-end tuned VAEs:
  • SOTA results on ImageNet 256×256: FID 1.12 with CFG and 1.69 without CFG
  • All models available as Hugging Face-compatible AutoencoderKL checkpoints
event [Jun 2025] REPA-E accepted at ICCV 2025!
event [Apr 2025] Paper, code, and pretrained models available on GitHub and Hugging Face.

Quickstart

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.35.0 torch>=2.5.0

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKLQwenImage

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images (note the frame dimension handling):

from io import BytesIO
import requests
from diffusers import AutoencoderKLQwenImage
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)

# Add frame dimension (required for QwenImage VAE)
image_ = image.unsqueeze(2)

with torch.no_grad():
    latents = vae.encode(image_).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

# Remove frame dimension
latents = latents.squeeze(2)
reconstructed = reconstructed.squeeze(2)
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

For complete usage examples and integration with diffusion models, see the individual model cards on Hugging Face.


Overview

We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model, achieving state-of-the-art FID scores of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256×256.

REPA-E Overview
REPA-E Overview. End-to-end training of VAE and diffusion model using representation alignment loss.

Through extensive evaluations, we demonstrate that our end-to-end training approach REPA-E offers four key advantages:

  1. E2E Leads to Faster Training: REPA-E significantly speeds up diffusion training by over 17× and 45× compared to REPA and vanilla training recipes, respectively.
  2. E2E Leads to Improved Latent Space: Joint tuning adaptively enhances latent space structure across different VAE architectures.
  3. E2E VAEs are Better than Regular VAEs: The resulting E2E-VAE serves as a drop-in replacement, improving convergence and generation quality across diverse LDM architectures. REPA-E also enables joint training of both VAE and LDM from scratch.
  • Core Insight: End-to-end training of VAE and diffusion model becomes effective with REPA loss, overcoming previous instability issues and achieving SOTA generation quality.

1. E2E Leads to Faster Training

REPA-E dramatically accelerates diffusion model training while achieving superior generation quality. We demonstrate consistent improvements across different model scales and VAE architectures.

Comparison of methods with and without end-to-end tuning
Method Tokenizer Epochs gFID↓ sFID↓ IS↑
Without End-to-End Tuning
MaskDiT [54] SD-VAE 1600 5.69 10.34 177.9
DiT [34] 1400 9.62 6.85 121.5
SiT [30] 1400 8.61 6.32 131.7
FasterDiT [49] 400 7.91 5.45 131.3
REPA [52] SD-VAE 20 19.40 6.06 67.4
40 11.10 6.06 67.4
80 7.90 5.06 122.6
800 5.90 5.73 157.8
With End-to-End Tuning (Ours)
REPA-E SD-VAE* 20 12.83 5.04 88.8
40 7.17 4.39 123.7
80 4.07 4.60 161.8
Scalability across diffusion model sizes
Diff. Model gFID↓ sFID↓ IS↑ Prec.↑ Rec.↑
SiT-B (130M) 49.5 7.00 27.5 0.46 0.59
+REPA-E (Ours) 34.8 6.31 39.1 0.57 0.59
SiT-L (458M) 24.1 6.25 55.7 0.62 0.60
+REPA-E (Ours) 16.3 5.69 75.0 0.68 0.60
SiT-XL (675M) 19.4 6.06 67.4 0.64 0.61
+REPA-E (Ours) 12.8 5.04 88.8 0.71 0.58
Generalization across different VAE architectures
Autoencoder gFID↓ sFID↓ IS↑ Prec.↑ Rec.↑
SD-VAE [39] 24.1 6.25 55.7 0.62 0.60
+REPA-E (Ours) 16.3 5.69 75.0 0.68 0.60
IN-VAE (f16d32) 22.7 5.47 56.0 0.62 0.62
+REPA-E (Ours) 12.7 5.57 84.0 0.69 0.62
VA-VAE [48] 12.8 6.47 83.8 0.71 0.58
+REPA-E (Ours) 11.1 5.31 88.8 0.72 0.61
Visual comparison at different iterations
Qualitative comparison between REPA and REPA-E. Images generated at different training iterations using identical noise and labels.
  • Finding 1: REPA-E achieves 17× speedup over REPA and 45× over vanilla training while delivering superior generation quality across all tested configurations.

2. E2E Leads to Improved Latent Space

End-to-end training with REPA-E adaptively refines the VAE's latent space structure without explicit regularization. Different VAE architectures exhibit different failure modes, and REPA-E addresses each appropriately.

PCA Analysis of Latent Space
PCA visualization of VAE latent spaces. End-to-end tuning with REPA-E improves latent representation quality across different VAE architectures.
  • Finding 2: End-to-end training adaptively improves latent space structure—reducing noise in SD-VAE while adding structural detail to IN-VAE/VA-VAE—without explicit regularization.

3. E2E VAEs are Better than Regular VAEs

The end-to-end tuned E2E-VAE serves as a universal drop-in replacement for standard VAEs, delivering consistent improvements across diverse diffusion model architectures without requiring any modifications to the training pipeline.

Drop-in VAE Performance Comparison
E2E-VAE as drop-in replacement. Comparison showing E2E-VAE delivers consistent improvements across different diffusion architectures.

From-Scratch Training

REPA-E enables effective joint training of both VAE and LDM from scratch, eliminating the need for separate VAE pre-training while still achieving superior performance compared to traditional approaches.

From-scratch training results. REPA-E enables effective joint training without VAE pre-training.
Method gFID↓ sFID↓ IS↑ Prec.↑ Rec.↑
100K Iterations (20 Epochs)
REPA [52] 19.40 6.06 67.4 0.64 0.61
REPA-E (scratch) 14.12 7.87 83.5 0.70 0.59
REPA-E (VAE init.) 12.83 5.04 88.8 0.71 0.58
200K Iterations (40 Epochs)
REPA [52] 11.10 5.05 100.4 0.69 0.64
REPA-E (scratch) 7.54 6.17 120.4 0.74 0.61
REPA-E (VAE init.) 7.17 4.39 123.7 0.74 0.62
400K Iterations (80 Epochs)
REPA [52] 7.90 5.06 122.6 0.70 0.65
REPA-E (scratch) 4.34 4.44 154.3 0.75 0.63
REPA-E (VAE init.) 4.07 4.60 161.8 0.76 0.62
  • Finding 3: E2E-VAE serves as a universal drop-in replacement, achieving SOTA FID of 1.12 (w/ CFG) and 1.69 (w/o CFG) across diverse diffusion architectures.

Conclusion

We introduced REPA-E, a method for end-to-end training of latent diffusion models and their VAE tokenizers. Our key findings demonstrate that:

  1. End-to-end training with REPA loss achieves 17× speedup over REPA and 45× speedup over vanilla training
  2. Joint training adaptively improves VAE latent space structure across different architectures
  3. E2E-VAE serves as a superior drop-in replacement, achieving SOTA FID of 1.12 on ImageNet 256×256

These results establish end-to-end training as a practical and effective approach for training latent diffusion models, opening new possibilities for joint architecture optimization and task-specific adaptations.

BibTeX

@article{leng2025repae,
  title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  year={2025},
  journal={arXiv preprint arXiv:2504.10483},
}

@misc{repaet2i2025,
  title={Family of End-to-End Tuned VAEs for Supercharging T2I Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Ryan Murdock and Ethan Smith and Rebecca Li and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  howpublished={\url{https://end2end-diffusion.github.io/repa-e-t2i/}},
  year={2025},
}

References

[1] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. CVPR.

[2] Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. ICCV.

[3] Yu, S., Sohn, K., Kim, S., & Shin, J. (2024). Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940.

[4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR.

[5] Esser, P., Kulal, S., Blattmann, A., et al. (2024). Scaling rectified flow transformers for high-resolution image synthesis. ICML.