Quickstart

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images:

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sd3.5-vae").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

Installation: pip install diffusers>=0.35.0 torch>=2.5.0

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKLQwenImage

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to("cuda")

Complete Example

Full workflow for encoding and decoding images (note the frame dimension handling):

from io import BytesIO
import requests
from diffusers import AutoencoderKLQwenImage
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKLQwenImage.from_pretrained("REPA-E/e2e-qwenimage-vae").to(device)

# Add frame dimension (required for QwenImage VAE)
image_ = image.unsqueeze(2)

with torch.no_grad():
    latents = vae.encode(image_).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

# Remove frame dimension
latents = latents.squeeze(2)
reconstructed = reconstructed.squeeze(2)

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-sdvae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-vavae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

Installation: pip install diffusers>=0.33.0 torch>=2.3.1

Quick Start

Loading the VAE is as easy as:

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to("cuda")

Complete Example

Full workflow for encoding and decoding images (512×512 resolution):

from io import BytesIO
import requests
from diffusers import AutoencoderKL
import numpy as np
import torch
from PIL import Image

response = requests.get("https://raw.githubusercontent.com/End2End-Diffusion/fuse-dit/main/assets/example.png")
device = "cuda"

image = torch.from_numpy(
    np.array(
        Image.open(BytesIO(response.content)).resize((512, 512))
    )
).permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 127.5 - 1
image = image.to(device)

vae = AutoencoderKL.from_pretrained("REPA-E/e2e-invae-hf").to(device)

with torch.no_grad():
    latents = vae.encode(image).latent_dist.sample()
    reconstructed = vae.decode(latents).sample

End-to-End VAE Training Recipe and T2I Setup

We perform end-to-end tuning on ImageNet 256×256 to obtain end-to-end tuned encoders for popular VAE families like FLUX-VAE, SD-3.5-VAE, Qwen-Image-VAE. We then compare the performance of obtained end-to-end tuned VAEs with the standard VAEs (e.g, Flux-VAE) for text-to-image (T2I) generation tasks.

End-to-End VAE Tuning on ImageNet 256×256. We follow the same training recipe as the original REPA-E paper, for end-to-end tuning on ImageNet 256×256. We use a small learning rate of 2×10^-5, AdamW optimizer and 80 epochs for end-to-end tuning across all VAE families (FLUX-VAE, SD-3.5-VAE, Qwen-Image-VAE). We next refer the corresponding finetuned VAEs as E2E-FLUX-VAE, E2E-SD-3.5-VAE, E2E-Qwen-Image-VAE respectively.

End-to-End Tuning Configuration for VAE:

VAE Models: FLUX-VAE, SD-3.5-VAE, Qwen-Image-VAE
Dataset: ImageNet-256
Training epochs: 80 epochs
Learning rate: 2×10^-5

T2I Training Setup. For our diffusion backbone, we follow the setup in Fuse-DiT and adopt a variant of the DiT-3B architecture with a self-attention-based text conditioning mechanism. We use Gemma-2B to encode text prompts into contextual embeddings. For training we use the BLIP-3o pretraining dataset (~28M samples) for T2I training with both original and end-to-end tuned VAEs. We perform experiments at both 256×256 and 512×512 resolutions. Unless otherwise specified, we use 25 sampling steps with a guidance scale of 6.5 for inference.

T2I Training Configuration:

Dataset: BLIP-3o (~28M samples)
Resolution: 256×256, 512×512
Batch size: 1024 (256×256), 448 (512×512)
Learning rate: Constant 1×10^-4
Optimizer: AdamW
EMA: Decay 0.9999, per-step update

We next evaluate the performance of end-to-end tuned VAEs on various text-to-image generation benchmarks.

Quantitative Results: T2I Performance

We evaluate End-to-End Tuned VAEs across multiple benchmarks and training scenarios, demonstrating consistent improvements over baseline VAEs. End-to-end tuned VAEs show faster convergence and better final performance across all metrics.

End-to-End VAEs Leads to Accelerated T2I Training

We compare training with original VAEs (FLUX-VAE, SD-3.5-VAE, Qwen-Image-VAE) against their end-to-end tuned counterparts across multiple evaluation benchmarks. End-to-End Tuned VAEs consistently achieve better performance across all metrics, with improvements particularly pronounced on vision-centric benchmarks like MJHQ-30K and GenEval.

**Training convergence at 100K steps.** Comparison of baseline VAEs vs End-to-End Tuned VAEs across three VAE families (FLUX-VAE, SD3.5-VAE, Qwen-Image-VAE) and five benchmarks: COCO30k FID, DPG-Bench, GenAI-Bench, GenEval, and MJHQ30k. End-to-end tuned VAEs show consistent improvements across all metrics and VAE architectures.

Training convergence at 500K steps with full data — **Training convergence (500K steps).** Extended training confirms sustained improvements across all benchmarks with End-to-End Tuned VAEs.

Finding 1: End-to-End VAEs tuned on just "ImageNet 256×256" generalize to T2I; leading to better training performance across various text-to-image generation benchmarks.

Comparison with REPA Representation Alignment

To understand the effectiveness of end-to-end tuning, we compare three approaches at 100K training steps: (1) FLUX-VAE baseline without modifications, (2) FLUX-VAE with REPA representation alignment losses added during T2I training, and (3) E2E-FLUX-VAE (ours) with end-to-end tuning but without additional alignment losses. The results demonstrate that end-to-end tuning outperforms both the baseline and REPA-enhanced approaches across all benchmarks, achieving superior performance without requiring auxiliary alignment objectives.

Comparison of FLUX-VAE, FLUX-VAE+REPA, and E2E-FLUX-VAE at 100K steps — **Performance comparison at 100K steps: E2E-FLUX-VAE vs REPA alignment.** Bar chart comparing three approaches across five benchmarks. E2E-FLUX-VAE (red) outperforms both FLUX-VAE baseline (blue) and FLUX-VAE+REPA with representation alignment (orange), demonstrating that end-to-end tuned VAEs lead to better T2I generation without need for any additional representation alignment losses.

Finding 2: End-to-End VAEs lead to faster T2I training over baseline REPA without need for any additional representation alignment losses.

End-to-End VAEs Generalize to Higher Resolutions

We also analyze the generalization of end-to-end tuned VAEs to higher resolutions. We resume training from the 500K checkpoint trained at 256×256 resolution and train for an additional 200K steps at 512×512 resolution with batch size 448. We observe that despite being trained at 256×256 resolution on ImageNet, end-to-end tuned VAEs continue to outperform the original VAEs across all benchmarks even when trained at 512×512 resolution for T2I generation.

Performance comparison at 512px resolution — **High-resolution training (512px, 200K steps) - Final performance.** Bar chart showing final performance comparison between FLUX-VAE and E2E-FLUX-VAE after 200K additional steps at 512×512 resolution. Performance improvements persist when resuming training at higher resolution, demonstrating that E2E-tuned VAEs generalize effectively across resolutions.

Finding 3: End-to-End VAEs tuned on ImageNet 256×256 generalize for better T2I generation across different resolutions (256×256, 512×512).

End-to-End Tuned VAEs Lead to Better Latent Space Structure

To understand what makes End-to-End Tuned VAEs effective, we analyze the learned latent representations through PCA projections and spatial similarity analysis. These visualizations reveal how end-to-end tuning shapes the VAE's latent space to better support high-quality generation. Notably, end-to-end tuning enriches the VAE latent space by incorporating more structural and semantic information over traditional VAEs (Flux-VAE, SDVAE) trained for reconstruction alone. Please also refer the original REPA-E paper for more details.

PCA Projections of VAE Latents

We project VAE latent representations to 2D using PCA and visualize them as RGB images (first 3 principal components). This reveals the spatial structure and semantic organization learned by different VAE architectures. As illustrated in the PCA visualizations, end-to-end training injects additional structural and semantic information into the latent representations.

**PCA projection visualization.** Comparison of latent space structure between baseline FLUX-VAE and E2E-Tuned Flux-VAE. Colors represent the first 3 principal components depicted as RGB for visualization. We observe that E2E tuned Flux-VAE shows more semantic and structural information in the latent space compared to baseline FLUX-VAE.

Spatial Self-Similarity Analysis

We also analyze the cosine similarity between patch tokens in the latent space to measure spatial structure. End-to-End Tuned VAEs show more coherent spatial patterns, indicating better capture of local and global image structure. As shown in the similarity maps, end-to-end tuning embeds more meaningful structural and semantic relationships between patches, making the latent space more informative for the diffusion model to generate high-quality images.

**Spatial self-similarity heatmaps.** Cosine similarity between latent patch tokens reveals spatial structure. E2E-Tuned FLUX-VAE shows more coherent patterns indicating better structural representation.

Finding 5: End-to-End tuned VAEs show improved semantic spatial structure and details over FLUX-VAE, as evidenced by PCA projections and spatial self-similarity analysis.

REPA-E for T2I

Family of end-to-end tuned VAEs for supercharging
T2I diffusion transformers

REPA-E 🤝 Canva

Quickstart

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

End-to-End VAE Training Recipe and T2I Setup

Quantitative Results: T2I Performance

End-to-End VAEs Leads to Accelerated T2I Training

Comparison with REPA Representation Alignment

End-to-End VAEs Generalize to Higher Resolutions

Qualitative Comparisons

Impact of End-to-End Tuning on VAE Reconstruction Quality

End-to-End Tuned VAEs Lead to Better Latent Space Structure

PCA Projections of VAE Latents

Spatial Self-Similarity Analysis

ImageNet Generalization

Limitations and Future Work

Conclusion

Team

BibTeX

REPA-E for T2I

Family of end-to-end tuned VAEs for superchargingT2I diffusion transformers

REPA-E 🤝 Canva

Quickstart

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

Quick Start

Complete Example

End-to-End VAE Training Recipe and T2I Setup

Quantitative Results: T2I Performance

End-to-End VAEs Leads to Accelerated T2I Training

Comparison with REPA Representation Alignment

End-to-End VAEs Generalize to Higher Resolutions

Qualitative Comparisons

Impact of End-to-End Tuning on VAE Reconstruction Quality

End-to-End Tuned VAEs Lead to Better Latent Space Structure

PCA Projections of VAE Latents

Spatial Self-Similarity Analysis

ImageNet Generalization

Limitations and Future Work

Conclusion

Team

BibTeX

Family of end-to-end tuned VAEs for supercharging
T2I diffusion transformers