Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Stability AI1, LMU Munich2, Birchlabs3, Independent Researchers4
ICML 2024

*Indicates Equal Contribution
Teaser of sample images from our HDiT models trained on FFHQ-1024^2 and ImageNet-256^2.

Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-10242 and ImageNet-2562.

Abstract

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 10242) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet-2562, and sets a new state-of-the-art for diffusion models on FFHQ-10242.

Efficiency

MY ALT TEXT

Scaling of computational cost w.r.t. target resolution of our HDiT-B/4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At megapixel resolutions, our model incurs less than 1% of the computational cost compared to the standard diffusion transformer DiT at a comparable size.

High-level Architecture Overview

MY ALT TEXT

High-level overview of our HDiT architecture, specifically the version for ImageNet at input resolutions of 2562 at patch size p = 4, which has three levels. For any doubling in target resolution, another neighborhood attention block is added. "lerp" denotes a linear interpolation with learnable interpolation weight. All HDiT blocks have the noise level and the conditioning (embedded jointly using a mapping network) as additional inputs.

Files

We provide the 50k generated samples used for FID computation for our 557M ImageNet model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8), with CFG = 1.3 (part 1, 2, 3, 4, 5, 6, 7, 8), and for our FFHQ-10242 model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21).

BibTeX

@InProceedings{crowson2024hourglass,
    title = 	 {Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers},
    author =       {Crowson, Katherine and Baumann, Stefan Andreas and Birch, Alex and Abraham, Tanishq Mathew and Kaplan, Daniel Z and Shippole, Enrico},
    booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
    pages = 	 {9550--9575},
    year = 	 {2024},
    editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
    volume = 	 {235},
    series = 	 {Proceedings of Machine Learning Research},
    month = 	 {21--27 Jul},
    publisher =    {PMLR},
    pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/crowson24a/crowson24a.pdf},
    url = 	 {https://proceedings.mlr.press/v235/crowson24a.html},
    abstract = 	 {We present the Hourglass Diffusion Transformer (HDiT), an image-generative model that exhibits linear scaling with pixel count, supporting training at high resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$. Code is available at https://github.com/crowsonkb/k-diffusion.}
}