We present REPA-E for T2I, a family of End-to-End Tuned VAEs for supercharging text-to-image generation training. End-to-end VAEs show superior performance over their original counterparts across all benchmarks (COCO30k, DPG-Bench, GenAI-Bench, GenEval, MJHQ30k) without need for any additional representation alignment losses.
presents
Family of end-to-end tuned VAEs for supercharging T2I diffusion transformers
We present REPA-E for T2I, a family of End-to-End Tuned VAEs for supercharging text-to-image generation training.
End-to-end VAEs show superior performance across all benchmarks (COCO30k
Click to jump to each section.
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")
Full workflow for encoding and decoding images:
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to("cuda")
Full workflow for encoding and decoding images:
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.35.0 torch>=2.5.0
Loading the VAE is as easy as:
from diffusers import AutoencoderKLQwenImage
vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")
Full workflow for encoding and decoding images (note the frame dimension handling):
from io import BytesIO
import requests
from diffusers import AutoencoderKLQwenImage
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)
# Add frame dimension (required for QwenImage VAE)
image_ = image.unsqueeze(2)
with torch.no_grad():
latents = vae.encode(image_).latent_dist.sample()
reconstructed = vae.decode(latents).sample
# Remove frame dimension
latents = latents.squeeze(2)
reconstructed = reconstructed.squeeze(2)
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
pip install diffusers>=0.33.0 torch>=2.3.1
Loading the VAE is as easy as:
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda")
Full workflow for encoding and decoding images (512×512 resolution):
from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image
response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"
image = torch.from_numpy(
np.array(
Image.open(BytesIO(response.content)).resize((512, 512))
)
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)
vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device)
with torch.no_grad():
latents = vae.encode(image).latent_dist.sample()
reconstructed = vae.decode(latents).sample
We perform end-to-end tuning on ImageNet 256×256
End-to-End VAE Tuning on ImageNet 256×256. We follow the same training recipe as the original REPA-E
End-to-End Tuning Configuration for VAE:
T2I Training Setup. For our diffusion backbone, we follow the setup in Fuse-DiT
T2I Training Configuration:
We next evaluate the performance of end-to-end tuned VAEs on various text-to-image generation benchmarks.
We evaluate End-to-End Tuned VAEs across multiple benchmarks and training scenarios, demonstrating consistent improvements over baseline VAEs. End-to-end tuned VAEs show faster convergence and better final performance across all metrics.
We compare training with original VAEs (FLUX-VAE
To understand the effectiveness of end-to-end tuning, we compare three approaches at 100K training steps:
(1) FLUX-VAE baseline without modifications,
(2) FLUX-VAE with REPA
We also analyze the generalization of end-to-end tuned VAEs to higher resolutions. We resume training from the 500K checkpoint trained at 256×256 resolution and train for an additional 200K steps at 512×512 resolution with batch size 448.
We observe that despite being trained at 256×256 resolution on ImageNet
For qualitative visualization, we use the 200K-step checkpoint trained at 512×512 resolution. All images are generated with 25 sampling steps and a guidance scale of 6.5. Beyond quantitative metrics, End-to-End Tuned VAEs produce visually superior results compared to baseline FLUX-VAE. The generated images show improved detail, better prompt adherence, and more coherent compositions. Below we show comparisons for T2I generations using models trained with FLUX-VAE and E2E-Tuned FLUX-VAE (Ours).
We also analyze the impact of end-to-end tuning on VAE reconstruction quality. Notably, despite being only tuned on ImageNet 256×256, end-to-end tuned VAEs show improved generation quality while maintaining reconstruction fidelity across challenging scenes with multiple faces, subjects and text.
To understand what makes End-to-End Tuned VAEs effective, we analyze the learned latent representations through PCA projections and spatial similarity analysis.
These visualizations reveal how end-to-end tuning shapes the VAE's latent space to better support high-quality generation.
Notably, end-to-end tuning enriches the VAE latent space by incorporating more structural and semantic information over traditional VAEs (Flux-VAE, SDVAE) trained for reconstruction alone. Please also refer the original REPA-E
We project VAE latent representations to 2D using PCA and visualize them as RGB images (first 3 principal components). This reveals the spatial structure and semantic organization learned by different VAE architectures. As illustrated in the PCA visualizations, end-to-end training injects additional structural and semantic information into the latent representations.
We also analyze the cosine similarity between patch tokens in the latent space to measure spatial structure. End-to-End Tuned VAEs show more coherent spatial patterns, indicating better capture of local and global image structure. As shown in the similarity maps, end-to-end tuning embeds more meaningful structural and semantic relationships between patches, making the latent space more informative for the diffusion model to generate high-quality images.
Finally, we show that the end-to-end tuned VAEs can also be used for traditional image generation benchmarks like ImageNet
In this work, we show how end-to-end tuned VAEs lead to better T2I training over more standard VAEs like Flux-VAE and SDVAE which primarily trained for reconstruction alone.
This primarily happens because end-to-end tuned VAEs lead to better semantic representations while maintaining strong reconstruction fidelity.
In future, we are actively studying the impact of end-to-end tuned VAEs for other downstream tasks which are also reliant on both the semantic and reconstrcution ability of the VAE latents. Examples include Image-to-Image Translation, Image-editing, Image inpainitng etc.
We present REPA-E for T2I, a family of End-to-End Tuned VAEs for supercharging text-to-image generation training.
End-to-end VAEs show superior performance over their original counterparts across all benchmarks (COCO30k
We hope REPA-E for T2I will inspire further research into end-to-end training strategies for generative models and the co-design of
VAE/RAE architectures with diffusion transformers.
This work is a joint collaboration between the REPA-E team and Canva.
REPA-E Team: Xingjian Leng*, Jaskirat Singh*, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng
Canva Team: Ryan Murdock, Ethan Smith, Rebecca Li
@article{leng2025repae,
title={{REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers}},
author={Leng, Xingjian and Singh, Jaskirat and Hou, Yunzhong and Xing, Zhenchang and Xie, Saining and Zheng, Liang},
journal={arXiv preprint arXiv:2504.10483},
year={2025}
}