Inf-DiT

Official implementation of Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

🆕 News

2024.05.20: This code and model weight is released.
2024.05.08: This repo is released.

⏳ TODO

Code release
Model weight release
Complete the explanation for the inference code and hyperparameter
Demo

🔆 Abstract

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096 × 4096), the resolution of generated images is often limited to 1024×1024. In this work, we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves excellent performance in generating ultra-high-resolution images. Compared to commonly used UNet structures, our model can save more than 5× memory when generating 4096 × 4096 images.

📚 Model Inference

Model weights can be downloaded from here

Download the model weights and put them in the 'ckpt'.
bash generate_sr_big_cli.sh
You can change the "inference_type"(line 27 in generate_sr_big_cli.sh) to "ar"(parallel size=1), "ar2"(parallel size = block_batch(line 28)) or "full"(generate the entire image in one forward).

🆚 Ultra-high-resolution generation Demo vs other methods

（click to see the detail)

vs DemoFusion

Caption: A digital painting of a young goddess with flower and fruit adornments evoking symbolic metaphors.

Resolution: $2048\times 2048$

vs BSRGAN

Caption: The image depicts a concept art of Schrodinger's cat in a box with an abstract background of waves and particles in a dynamic composition.

Resolution: $2048\times 2048$

vs Patch-Super-Resolution(4096*4096)

Caption: A portrait of a character in a scenic environment.

Resolution: $4096\times 4096$

👀 Super-Resolution results

（click to see the detail)

Resolution: $1920\times 1080$

Resolution: $1920\times 768$

⚙️ Setup

📖 Citation

Please cite us if our work is useful for your research.

@misc{yang2024infdit,
      title={Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer}, 
      author={Zhuoyi Yang and Heyang Jiang and Wenyi Hong and Jiayan Teng and Wendi Zheng and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2024},
      eprint={2405.04312},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📭 Contact

If you have any comments or questions, feel free to contact zhuoyiyang2000@gmail.com or jianghy0581@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
dit		dit
image/README		image/README
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_sr_big_cli.sh		generate_sr_big_cli.sh
generate_t2i_sr.py		generate_t2i_sr.py
requirement.txt		requirement.txt
train_text2image_sr.py		train_text2image_sr.py
train_text2image_sr_big_clip.sh		train_text2image_sr_big_clip.sh

License

THUDM/Inf-DiT

Folders and files

Latest commit

History

Repository files navigation

Inf-DiT

🆕 News

⏳ TODO

🔆 Abstract

📚 Model Inference

🆚 Ultra-high-resolution generation Demo vs other methods

（click to see the detail)

vs DemoFusion

vs BSRGAN

vs Patch-Super-Resolution(4096*4096)

👀 Super-Resolution results

（click to see the detail)

⚙️ Setup

📖 Citation

📭 Contact

About

Resources

License

Stars

Watchers

Forks

Languages