REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng1*·Jaskirat Singh1*·Yunzhong Hou1
Zhenchang Xing2·Saining Xie3·Liang Zheng1
1 Australian National University   2Data61-CSIRO   3New York University  
*Project Lead
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

News


event [Oct 2025] 🚨 Released REPA-E for T2I 🚨 — a family of End-to-End Tuned VAEs:
  • Family of end-to-end tuned VAEs:
  • End-to-end training generalizes to T2I: E2E-VAEs achieve better T2I generation quality across multiple resolutions (256×256, 512×512) compared to their standard VAE counterparts, without requiring additional representation alignment losses
  • SOTA results on ImageNet 256×256: FID 1.12 with CFG and 1.69 without CFG
  • All models available as Hugging Face-compatible AutoencoderKL checkpoints — load directly with diffusers API, no custom wrapper needed
event [Jun 2025] REPA-E accepted at ICCV 2025!
event [Apr 2025] Paper, code, and pretrained models available on GitHub and Hugging Face.

Quickstart


Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.35.0 torch>=2.5.0

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKLQwenImage

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images (note the frame dimension handling):

from io import BytesIO
import requests
from diffusers import AutoencoderKLQwenImage
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)

# Add frame dimension (required for QwenImage VAE)
image_ = image.unsqueeze(2)

with torch.no_grad():
    latents = vae.encode(image_).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

# Remove frame dimension
latents = latents.squeeze(2)
reconstructed = reconstructed.squeeze(2)
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample
Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

For complete usage examples and integration with diffusion models, see the individual model cards on Hugging Face.

Overview


We address a fundamental question: Can latent diffusion models and their VAE tokenizer be trained end-to-end? While training both components jointly with standard diffusion loss is observed to be ineffective — often degrading final performance — we show that this limitation can be overcome using a simple representation-alignment (REPA) loss. Our proposed method, REPA-E, enables stable and effective joint training of both the VAE and the diffusion model, achieving state-of-the-art FID scores of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256×256.

REPA-E Overview
Through extensive evaluations, we demonstrate that our end-to-end training approach REPA-E offers four key advantages:

1. Accelerated Generation Performance: REPA-E significantly speeds up diffusion training by over 17× and 45× compared to REPA and vanilla training recipes, respectively, while achieving superior quality.

2. Improved VAE Latent-Space Structure: Joint tuning adaptively enhances latent space structure across different VAE architectures, addressing their specific limitations without explicit regularization.

3. Superior Drop-in VAE Replacements: The resulting E2E-VAE serves as a drop-in replacement for existing VAEs (e.g., SD-VAE), improving convergence and generation quality across diverse LDM architectures.

4. Effective From-Scratch Training: REPA-E enables joint training of both VAE and LDM from scratch, still achieving superior performance compared to traditional training approaches.

1. End-to-End Training Leads to Accelerated Generation Performance


Combined
Qualitative comparison between REPA and REPA-E at different training iterations

2. End-to-End Training Improves VAE Latent-Space Structure


PCA Analysis of Latent Space

3. End-to-End Tuned VAEs as Superior Drop-in Replacements


Drop-in VAE Performance Comparison

Note: Results for the 800-epoch checkpoints of REPA, LightningDiT, and E2E-VAE in the left table are evaluated with 50 images per class (class-balanced sampling).

4. Enables Effective From-Scratch Training


  • End-to-end training from scratch: REPA-E can jointly train both VAE and LDM from scratch in an end-to-end manner, without requiring VAE pre-training
  • Strong performance even without initialization: While initializing the VAE with pretrained weights helps slightly improve results, from-scratch training still achieves gFID of 4.34 at 80 epochs, significantly outperforming REPA (7.90)
From-Scratch Training Performance

Citation


@article{leng2025repae,
  title={REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers},
  author={Xingjian Leng and Jaskirat Singh and Yunzhong Hou and Zhenchang Xing and Saining Xie and Liang Zheng},
  year={2025},
  journal={arXiv preprint arXiv:2504.10483},
}