DiffusionBench

On Holistic Evaluation of Diffusion Transformers

with a Unified Training Framework Bridging ImageNet and Text-to-Image

Xingjian Leng1,2, Jaskirat Singh1, Zhanhao Liang1, Ethan Smith2, Martin Bell2, Aninda Saha2, Yuhui Yuan2, Liang Zheng1,2
1Australian National University    2Canva Research

TL;DR: NanoGen is a unified framework that trains and evaluates diffusion transformers across ImageNet and text-to-image with only roughly 12 lines of config change. We use it to show that method ranking shows no strong correlation between ImageNet and T2I generation, and introduce DiffusionBench, a holistic benchmark for DiT research.

Performance Icon
Matching State-of-the-Art Performance: The unified NanoGen training framework matches state-of-the-art DiT methods on ImageNet.
Config Icon
Effortless ImageNet → T2I: Going from ImageNet to text-to-image takes only roughly 12 lines of config change.
Evaluation Icon
Holistic Evaluation: A systematic comparison of 25 methods across ImageNet and T2I under various metrics.
diffusion-bench
############################################################################
#                                                                            #
#   ____  _  __  __           _                            .-----------.     #
#  |  _ \(_)/ _|/ _|_   _ ___(_) ___  _ __                 |           |     #
#  | | | | | |_| |_| | | / __| |/ _ \| '_ \                | ░▒▓█▓▒░▒▓ |     #
#  | |_| | |  _|  _| |_| \__ \ | (_) | | | |               | ▒▓█████▓▒ |     #
#  |____/|_|_| |_|  \__,_|___/_|\___/|_| |_|               | ▓███████▓ |     #
#                                                          |     ↓     |     #
#   ____                  _                                | █████████ |     #
#  | __ )  ___ _ __   ___| |__                             | ▓███████▓ |     #
#  |  _ \ / _ \ '_ \ / __| '_ \                            | ▒▓█████▓▒ |     #
#  | |_) |  __/ | | | (__| | | |                           |           |     #
#  |____/ \___|_| |_|\___|_| |_|                           '-----------'     #
#                                                                            #
#           Because ImageNet evaluation alone is no longer enough!           #
#                                                                            #
############################################################################

Overview

Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified training and evaluation framework that matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models.

  1. §A Unified Framework: We release NanoGen, a unified DiT training framework that matches state-of-the-art methods on ImageNet and extends to text-to-image training with roughly 12 lines of config changes.
  2. §ImageNet Rankings Don't Reliably Predict T2I: We observe that method ranking shows no strong correlation between ImageNet and T2I generation, large enough to flip conclusions from ImageNet.
  3. §A Holistic Benchmark: We incorporate both tasks into DiffusionBench, present results across many existing methods, and argue for its adoption as a default DiT benchmark.

Quickstart

NanoGen supports training and evaluation across tasks (ImageNet, T2I, ...) through a single interface. Below are the essentials. Full details in the GitHub repo.

Switching from ImageNet to text-to-image is a config change, not a code change — roughly 12 lines. Only two blocks move: the conditioning module and the dataset. The backbone, optimiser, training loop, and evaluation harness stay identical.

ImageNetimagenet.yaml
# stage_1, stage_2, transport, sampler,# training, misc ... shared across tasks conditioning:  # class label  type: "label"  cfg_dropout_prob: 0.1  arch:    num_t_tokens: 4    num_c_tokens: 8  # no text encoder  # for class  # conditioning dataset:  target: "imagenet"  type: "hf"  data_dir: "./data/imagenet"  split: "train"  condition_type: "label"
Text-to-Imaget2i.yaml
# stage_1, stage_2, transport, sampler,# training, misc ... shared across tasks conditioning:  # text prompt  type: "text"  cfg_dropout_prob: 0.1  arch:    num_t_tokens: 4    # num_c_tokens: 8  text_encoder:    model_name: "Qwen/Qwen3-0.6B"    max_length: 256 dataset:  target: "blip3o"  type: "wds"  data_dir: "./data/blip3o-256"  split: ["journeydb", "long-caption"]  condition_type: "text"
# install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# install dependencies
uv sync

# prepare data
uv run python scripts/prepare.py --data {all,imagenet,t2i,eval}

# download pretrained models
uv run hf download diffusion-bench/diffusion-bench --local-dir pretrained_models --exclude .gitattributes

Reproduction flow: Stage 1 → Stage 2. Set these environment variables first (used for the output directory and W&B logging):

export EXPERIMENT_NAME=<run-name>
export ENTITY=<wandb-entity>
export PROJECT=<wandb-project>
export WANDB_KEY=<key>

Stage 1 — train the RAE tokenizer

uv run torchrun --standalone --nproc_per_node=8 \
    src/train_stage1.py \
    --config [STAGE1_CONFIG_PATH] \
    --results-dir results/stage1 --precision bf16 --compile --wandb

Stage 2 — train the diffusion model (VAE / RAE / Pixel)

uv run torchrun --standalone --nproc_per_node=8 \
    src/train.py \
    --config [STAGE2_CONFIG_PATH] \
    --results-dir results/stage2 --precision bf16 --compile --wandb

Stage 2 training configs run online evaluation during training. For standalone evaluation of a released checkpoint, use the sampling/ configs — each embeds the trained checkpoint and eval-time guidance, so the weights load automatically:

export EXPERIMENT_NAME=<run-name>

# stage 1 reconstruction (rFID / PSNR / SSIM / LPIPS)
uv run torchrun --nproc_per_node=8 src/offline_eval_stage1.py --config [STAGE1_CONFIG_PATH]

# stage 2 generation (FID / IS, GenEval / DPGBench / ...)
uv run torchrun --nproc_per_node=8 src/offline_eval.py --config [STAGE2_CONFIG_PATH]

The NanoGen Training and Evaluation Framework

NanoGen is a diffusion model training and evaluation framework that supports both class-conditional ImageNet generation and text-to-image generation under a single codebase. Its goal is to make the additional cost of evaluating a method on the T2I task as low as possible

Design principles.

Across ImageNet & T2I, we use the same:
backbone optimiser training loop evaluation harness config format
Switching between tasks requires only two changes:
dataset conditioning module

Backbone architecture. A standard diffusion transformer with three deliberate modifications: a Decoupled Diffusion Transformer (DDT) backbone whose encoder-decoder split increases effective width without the quadratic FLOPs cost of a uniformly wide DiT; no AdaLN in the encoder; and in-context conditioning, feeding all conditioning (including the timestep) as tokens, so adapting to a new task just means changing those tokens.

Task-specific conditioning tokens. The only per-task difference:

Training recipe.

Setup
  • Training
    • Resolution: 256×256
    • Iterations: 100K
    • Batch size: 1024
    • Optimizer: AdamW, learning rate 2×10-4 with linear decay
    • Sampler: 50-step Euler
  • ImageNet
    • Conditioning: 4 timestep + 8 class tokens
    • Guidance: without CFG / per-method best CFG (applied on the t-interval [0, 0.9])
  • Text-to-image
    • Data: BLIP-3o (JourneyDB, Long-Caption, and Short-Caption splits)
    • Conditioning: 4 timestep + 256 text tokens from a frozen Qwen3-0.6B encoder
    • Guidance: classifier-free guidance scale 6.0

ImageNet Reproducibility Validation

We first confirm that NanoGen establishes trustworthy ImageNet baselines, where the implemented methods match reported numbers in their papers. We re-implement and re-train six existing methods, including three latent-space (RAE, two E2E-VAEs) and three pixel-space methods (PixNerd, JiT, PixelGen). We observe that the NanoGen results are competitive with published numbers and sometimes slightly superior.

Method Epochs #Params Prediction NFE with Guidance FID ↓ IS ↑
Latent-space
RAE (DINOv2-B)80839Mv50×2.16214.8
Ours80847Mv50×2.07213.5
E2E-VAVAE80675Mv250×5.26
Ours80680Mv250×3.64152.5
E2E-VAVAE + REPA80675Mv250×3.46159.8
Ours80681Mv250×2.88165.4
Pixel-space
PixNerd160458Mv1002.64297.0
Ours160446Mv1002.58299.3
JiT200131Mx508.62
Ours20088Mx505.49231.6
PixelGen40459Mx50×7.53131.7
Ours40458Mx50×7.52123.5
ImageNet-256 reproducibility across latent-space and pixel-space methods

DiffusionBench: A Holistic Benchmark

ImageNet generation. The best FID (1.37) is achieved by FLUX.2-VAE, followed by the end-to-end REPA-E VAEs at around 1.5–1.6. The RAE family is slightly higher, with the better ones around 1.7–1.9 (DINOv3-B is the best RAE at 1.74). Traditional VAEs such as SD-VAE and SD3.5-VAE, along with the pixel-space methods, lag behind, though at 80 epochs this gap is largely driven by convergence speed and should narrow with longer training. Toggle guidance (with / without CFG) and the metric (FDr / MIND).

Guidance:
Distributional metric:
Method FID ↓ IS ↑ FDr ↓MIND ↓
InceptionConvNeXtDINOv2MAESigLIP
DINOv2-B1.962.14224.1211.91.221.322.032.542.202.3366.4970.533.263.1427.7426.526.196.330.430.447.767.566.526.30
DINOv2-B + REG1.842.08236.2207.71.151.291.902.232.152.4564.3076.203.213.2627.4027.986.436.420.440.457.717.826.436.47
DINOv3-B1.742.15244.2200.81.091.331.802.162.102.8265.8397.153.303.8929.3533.626.536.800.430.457.668.176.036.45
DINOv3-B + REG1.782.15248.1204.41.111.321.882.352.122.7165.8190.393.413.7929.9732.336.546.710.430.457.888.116.176.38
SigLIP2-B2.613.48222.9179.41.592.111.862.453.014.2097.91162.717.338.5576.0185.8110.6110.970.870.9011.3811.769.8910.15
PE-L2.843.08221.5206.61.721.862.863.052.803.1590.0797.695.916.1756.8058.4010.9011.010.880.899.889.878.198.08
LangPE-L2.462.76196.7182.21.481.652.122.532.993.48108.83138.396.276.6661.8664.018.878.970.680.6910.2410.289.039.07
SpatialPE-L1.863.61247.1160.41.162.181.352.351.824.1755.46206.524.677.3045.7969.926.627.010.510.548.5710.747.239.16
Latent-space (VAE)
SD-VAE-EMA2.4310.16259.6113.41.385.972.929.561.9210.1448.52688.827.7118.1787.44245.956.9411.110.620.9919.1535.4320.3938.87
SD-VAE-EMA + REG2.345.39271.6162.71.323.142.043.451.745.1455.02246.057.2411.3084.72145.137.559.580.690.8718.4725.6320.1428.41
SD-VAE-MSE2.5610.15259.7112.41.455.973.109.342.1510.3655.66676.487.6117.8486.97243.237.8212.030.701.0721.8238.1925.0143.99
SDXL-VAE3.0712.88256.0104.71.697.503.3812.962.7412.5870.85858.549.2621.50109.22297.598.6413.570.751.2121.5540.3922.6644.20
SD3.5-VAE2.6410.18262.9111.71.515.963.048.502.1410.2175.53634.857.7617.9789.20240.526.1910.390.550.9615.1630.8514.0430.70
FLUX.1-VAE3.5515.75245.786.62.049.283.3714.813.8916.66107.701072.699.2523.27105.18316.558.1914.020.821.4119.5441.4618.6242.58
FLUX.2-VAE1.374.53272.7146.90.892.760.902.531.074.6524.98234.794.328.5643.5996.753.905.150.310.4110.7517.419.8416.46
FLUX.2-VAE + REG1.444.19294.1155.80.922.561.062.110.954.0637.56183.744.277.9142.4788.653.915.170.310.4110.1716.749.4215.82
Qwen-Image-VAE3.0110.86238.9108.91.856.522.759.344.5613.25159.26804.539.7719.78118.42270.509.1913.450.901.3125.4641.5125.9744.01
E2E-VAVAE1.654.27275.4147.61.082.642.132.201.995.9062.17277.264.518.5950.75103.044.716.330.360.519.8316.309.1715.54
E2E-FLUX.1-VAE1.676.30266.3134.31.073.831.164.231.836.8950.73373.505.1210.9253.43129.484.686.670.360.5311.7621.2810.7520.39
E2E-SD3.5-VAE1.625.32265.4140.81.163.371.103.191.305.1042.14266.874.579.2148.18108.445.497.070.420.5511.1218.7010.2517.69
E2E-Qwen-Image-VAE1.554.98261.4138.41.063.111.542.842.266.8061.19337.954.579.5748.81113.684.636.460.370.5311.3319.8510.2418.88
Pixel-space
JiT4.0821.72231.265.02.3812.823.9724.194.5922.87146.371632.569.5728.16113.63412.3313.7023.591.232.1522.5153.7621.2155.40
PixNerd4.1720.61213.863.92.4512.184.2422.774.0121.42104.211581.558.3325.0886.10345.1911.6719.670.961.7120.6548.2418.9449.33
PixelGen3.9712.10247.4104.02.267.013.969.074.3413.93138.33769.267.8017.8986.50222.0213.1418.981.121.6817.9734.4915.7332.35
One-/Few-step
MeanFlow (NFE=1)6.6024.83206.761.43.7114.566.1931.564.7124.09116.131993.7117.3537.39203.45571.1915.5421.151.351.8843.6970.6149.7281.85
MeanFlow (NFE=2)5.4020.58226.563.03.1912.247.9324.853.2521.9769.051860.6713.1934.62148.75524.109.8416.230.831.4228.4258.3127.4562.69
Systematic comparison on ImageNet-256.

Text-to-image generation. We train the same methods as text-to-image models and score them with GenEval, DPG-Bench, and GenAIBench. Public T2I models are shown for reference.

Method Iters #Params GenEval ↑ DPG-Bench ↑ GenAIBench ↑
Public models (reference)
SD-3.5-Large8B0.6910.8420.767
FLUX-112B0.6540.8380.748
FLUX-232B0.8540.8700.841
Qwen-Image20B0.8480.8880.803
Z-Image-Turbo6B0.7360.8470.759
Latent-space (RAE)
DINOv2-B100K615M0.6280.8100.707
DINOv2-B + REG100K619M0.6080.8080.702
DINOv3-B100K615M0.6360.8280.718
DINOv3-B + REG100K619M0.6420.8270.730
SigLIP2-B100K615M0.6060.8090.718
PE-L100K617M0.5860.8180.723
SpatialPE-L100K617M0.5350.7900.694
LangPE-L100K617M0.6330.8260.724
LangPE-L200K617M0.6350.8240.715
Latent-space (VAE)
SD-VAE-EMA100K611M0.5780.8040.691
SD-VAE-EMA + REG100K615M0.5700.7920.691
SD-VAE-MSE100K611M0.6240.8130.701
SDXL-VAE100K611M0.6170.8120.705
SD3.5-VAE100K612M0.6400.8180.702
Qwen-Image-VAE100K612M0.6110.8020.704
E2E-FLUX.1-VAE100K612M0.6250.8230.706
E2E-SD3.5-VAE100K612M0.6370.8400.715
E2E-Qwen-Image-VAE100K612M0.6910.8350.714
FLUX.1-VAE100K612M0.5590.7960.684
FLUX.1-VAE200K612M0.5440.8160.687
FLUX.2-VAE100K612M0.6750.8300.712
FLUX.2-VAE200K612M0.6250.8410.713
FLUX.2-VAE + REG100K616M0.6870.8300.722
E2E-VAVAE100K611M0.6320.8240.703
E2E-VAVAE200K611M0.6790.8360.716
Pixel-space
JiT100K615M0.5160.7820.674
PixNerd100K615M0.4840.7770.643
PixelGen100K615M0.5540.7980.678
One-/Few-step
MeanFlow (NFE=1)100K613M0.2870.6880.582
MeanFlow (NFE=2)100K613M0.3410.7210.602
Systematic comparison on text-to-image generation.

ImageNet vs. text-to-image.

Correlation between ImageNet FID and three T2I metrics. Toggle the guidance setting between without CFG and with CFG. Drag a rectangle to recompute r on a comparable cluster of points (double-click to reset), or toggle categories in the legend.

We have several observations. First, in the state-of-the-art frontier, ImageNet ranking does not robustly predict the T2I ranking. For example, RAE with SpatialPE-L has very good ImageNet FID, but its T2I performance is among the worst across various metrics. Second, different metrics to some extent disagree with each other. For example, E2E-Qwen-Image-VAE is one of the strongest if we look at GenEval and DPG-Bench metrics but it falls into the second tier under the GenAIBench metric. Third, ImageNet trend is consistent with T2I trend if we look at broader method category ranking. That is, improved latent-space methods (RAE, FLUX.2-VAE, REPA-E) > traditional latent-space > pixel-space > MeanFlow, so ImageNet signals are useful at the category level. But most state-of-the-art methods report FID between 1 and 2, which fall into the most uncorrelated regions. Fourth, when we train T2I for 200K steps, the performance generally remains similar or improves slightly under the three metrics. This observation is interesting: upon visual check below, images at 200K training are better than at 100K. We suspect that better metrics should be proposed.

Text-to-image qualitative samples at 256x256 from FLUX.1-VAE, FLUX.2-VAE, E2E-VAVAE, and LangPE-L at 100K and 200K iterations
Text-to-image qualitative samples at 256×256. Curated qualitative samples from NanoGen latent-space methods trained for 100K and 200K iterations at batch size 1024.

As shown in the figure below, training T2I remains efficient across all methods. Moreover, training cost is comparable across latent-space methods, while pixel-space methods such as JiT, PixNerd, and PixelGen are much cheaper to train on ImageNet because they do not compute latents from VAEs. RAE methods are marginally faster to train than VAE methods. MeanFlow is much slower than other T2I methods.

Wall-clock training time per 100K steps for ImageNet vs text-to-image setups across 25 DiT methods
Wall-clock training time of ImageNet and T2I setups.

Recommended usage. Our recommendation is that future DiT papers report DiffusionBench, which includes both ImageNet and T2I generation, rather than any single axis. Methods that improve DiffusionBench are more likely to reflect broadly useful progress; methods that improve one axis but regress another may still be valuable, but should be labelled as task-specific improvements rather than general DiT advances.


Conclusion

Diffusion transformer research has matured to the point where single-benchmark evaluation is no longer enough. In this paper we introduce NanoGen, a training and evaluation framework that removes the engineering barrier to training and evaluating DiT methods on the T2I task, and use it to show that ImageNet rankings do not reliably predict text-to-image performance. Finally, we package the two evaluation axes: ImageNet and T2I generation, into DiffusionBench and argue for its adoption as the default DiT benchmark. Our hope is that making holistic evaluation cheap, both engineering-wise and computationally, will shift the field toward progress that is broad rather than local.

Team

Xingjian Leng, Jaskirat Singh, Zhanhao Liang, Ethan Smith, Martin Bell, Aninda Saha, Yuhui Yuan, Liang Zheng.

BibTeX

@misc{diffusionbench2025,
  title={{DiffusionBench: On Holistic Evaluation of Diffusion Transformers}},
  author={Leng, Xingjian and Singh, Jaskirat and Liang, Zhanhao and Smith, Ethan and Bell, Martin and Saha, Aninda and Yuan, Yuhui and Zheng, Liang},
  howpublished={\url{https://end2end-diffusion.github.io/diffusion-bench/}},
  year={2025}
}