Skip to content

Latest commit

 

History

History
243 lines (203 loc) · 27.4 KB

[CVPRW 2023🎈] Best Collection.md

File metadata and controls

243 lines (203 loc) · 27.4 KB

Hits

“𝑇ℎ𝑒 𝑏𝑎𝑏𝑦, 𝑎𝑠𝑠𝑎𝑖𝑙𝑒𝑑 𝑏𝑦 𝑒𝑦𝑒𝑠, 𝑒𝑎𝑟𝑠, 𝑛𝑜𝑠𝑒, 𝑠𝑘𝑖𝑛, 𝑎𝑛𝑑 𝑒𝑛𝑡𝑟𝑎𝑖𝑙𝑠 𝑎𝑡 𝑜𝑛𝑐𝑒, 𝑓𝑒𝑒𝑙𝑠 𝑖𝑡 𝑎𝑙𝑙 𝑎𝑠 𝑜𝑛𝑒 𝑔𝑟𝑒𝑎𝑡 𝑏𝑙𝑜𝑜𝑚𝑖𝑛𝑔, 𝑏𝑢𝑧𝑧𝑖𝑛𝑔 𝑐𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛.” -- 𝑊𝑖𝑙𝑙𝑖𝑎𝑚 𝐽𝑎𝑚𝑒𝑠

🍓 Content 🍓

1. Introduction «🎯Back To Top»

The human perceptual system is a complex and multifaceted construct. The five basic senses of hearing, touch, taste, smell, and vision serve as primary channels of perception, allowing us to perceive and interpret most of the external stimuli encountered in this “blooming, buzzing confusion” world. These stimuli always come from multiple events spread out spatially and temporally distributed.

In other words, we constantly perceive the world in a “multimodal” manner, which combines different information channels to distinguish features within confusion, seamlessly integrates various sensations from multiple modalities and obtains knowledge through our experiences.

2. Background «🎯Back To Top»

Table 1. Chronological timeline of representative text-to-image datasets.

“Public” includes a link to each dataset (if available✔) or paper (if not❌).
“Annotations” denotes the number of text descriptions per image.
“Attrs” denotes the total number of attributes in each dataset.

Year Dataset Public Category Image (Resolution) Annotations Attrs Other Information
2008 Oxford-102 Flowers Flower 8,189 (-) 10 - -
2011 CUB-200-2011 Bird 11,788 (-) 10 - BBox, Segmentation...
2014 MS-COCO2014 Iconic Objects 120k (-) 5 - BBox,Segmentation...
2018 Face2Text Face 10,177 (-) 1~ - -
2019 SCU-Text2face Face 1,000 (256×256) 5 - -
2020 Multi-ModalCelebA-HQ Face 30,000 (512×512) 10 38 Masks,Sketches
2021 FFHQ-Text Face 760 (1024×1024) 9 162 BBox
2021 M2C-Fashion Clothing 10,855,753 (256×256) 1 - -
2021 CelebA-Dialog Face 202,599 (178×218) ~5 5 Identity Label...
2021 Faces a la Carte Face 202,599 (178×218) ~10 40 -
2021 LAION-400M Random Crawled 400M (-) 1 - KNN Index...
2022 Bento800 Food 800 (600×600) 9 - BBox, Segmentation, Label...
2022 LAION-5B Random Crawled 5.85B (-) 1 - URL, Similarity, Language...
2022 DiffusionDB Synthetic Images 14M (-) 1 - Size, Random Seed...
2022 COYO-700M Random Crawled 747M (-) 1 - URL, Aesthetic Score...
2022 DeepFashion-MultiModal Full Body 44,096 (750×1101) 1 - Densepose, Keypoints...
2023 ANNA News 29,625 (256×256) 1 - -
2023 DreamBooth Objects & Pets 158 (-) 25 - -

2.2 Evaluation Metrics «🎯Back To Top»

Automatic Evaluation

👆🏻: Higher is better. 👇🏻: Lower is better.

Human Evaluation

Participants are asked to rate generated images based on two criteria: plausibility (including object accuracy, counting, positional alignment, or image-text alignment) and naturalness (whether the image appears natural or realistic).
The evaluation protocol is designed in a 5-Point Likert manner, in which human evaluators rate each prompt on a scale of 1 to 5, with 5 representing the best and 1 representing the worst.

For rare object combinations that require common sense understanding or aim to avoid bias related to race or gender, human evaluation is even more important.

3. Generative Models «🎯Back To Top»

A comprehensive list of text-to-image approaches.

The pioneering works in each development stage are highlighted. Text-to-face generation works are start with a emoji(👸).

3.1 GAN Model «🎯Back To Top»

[Conditional GAN-based]

  • 2016~2021:
    • Generative Adversarial Text to Image Synthesis [Paper] [Code]
    • Learning What and Where to Draw [Paper] [Code]
    • Adversarial nets with perceptual losses for text-to-image synthesis [Paper]
    • I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation [Paper] [Code]
    • Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis [Paper]
    • MC-GAN: Multi-conditional Generative Adversarial Network for Image Synthesis [Paper] [Code]
    • Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction [Paper] [Code]

[StackGAN-based]

  • 2017:
    • StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks [Paper] [Code]
  • 2018:
    • StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Paper] [Code]
    • Text-to-image-to-text translation using cycle consistent adversarial networks [Paper] [Code]
    • AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks [Paper] [Code]
    • ChatPainter: Improving Text to Image Generation using Dialogue [Paper]
  • 2019:
    • 👸 FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation [Paper]
    • C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis [Paper]
    • Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis [Paper]
    • Semantics Disentangling for Text-to-Image Generation [Paper] [Website]
    • MirrorGAN: Learning Text-to-image Generation by Redescription [Paper] [Code]
    • Controllable Text-to-Image Generation [Paper] [Code]
    • DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
  • 2020:
    • CookGAN: Causality based Text-to-Image Synthesis [Paper]
    • RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge [Paper]
    • KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis [Paper]
    • CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
    • End-to-End Text-to-Image Synthesis with Spatial Constrains [Paper]
    • Semantic Object Accuracy for Generative Text-to-Image Synthesis [Paper] [Code]
  • 2021:
    • 👸 Multi-caption Text-to-Face Synthesis: Dataset and Algorithm [Paper] [Code]
    • 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation [Paper]
    • 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model [Paper]
    • Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis [Paper]
    • Unsupervised text-to-image synthesis [Paper]
    • RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge [Paper]
  • 2022:
    • 👸 DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis [Paper]
    • 👸 CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis [Paper]
    • DR-GAN: Distribution Regularization for Text-to-Image Generation [Paper] [Code]
    • T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up [Paper] [Code]

[StlyeGAN-based]

  • 2021:
    • 👸 TediGAN: Text-Guided Diverse Image Generation and Manipulation [Paper] [Extended Version][Code] [Dataset] [Colab] [Video]
    • 👸 Faces a la Carte: Text-to-Face Generation via Attribute Disentanglement [Paper]
    • Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [Paper]
  • 2022:
    • 👸 Text-Free Learning of a Natural Language Interface for Pretrained Face Generators [Paper] [Code]
    • 👸 clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [Paper] [Code]
    • 👸 TextFace: Text-to-Style Mapping based Face Generation and Manipulation [Paper]
    • 👸 AnyFace: Free-style Text-to-Face Synthesis and Manipulation [Paper]
    • 👸 StyleT2F: Generating Human Faces from Textual Description Using StyleGAN2 [Paper] [Code]
    • 👸 StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [Paper] [Code]
    • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Paper] [Code]

[Others]

  • 2018:
    • (Hierarchical adversarial network) Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network [Paper] [Code]
  • 2021:
    • (BigGAN) CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders [Paper] [Code]
    • (BigGAN) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization [Paper] [Code]
  • 2022:
    • (One-stage framework) Text to Image Generation with Semantic-Spatial Aware GAN [Paper] [Code]

3.2 Autogressive Model «🎯Back To Top»

[Transformer-based]

  • 2021:
  • 2022:
    • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers [Paper] [Code]
    • Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [Paper] [Code] [Project]
    • Neural Architecture Search with a Lightweight Transformer for Text-to-Image Synthesis [Paper]
    • DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers [Paper] [Code]
    • CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [Paper] [Code]
    • Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer [Paper]
    • Autoregressive Image Generation using Residual Quantization [Paper] [Code]
    • Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [Paper] [Code] [The Little Red Boat Story]

3.3 Diffusion Model «🎯Back To Top»

[Diffusion-based]

  • 2022:
    • High-Resolution Image Synthesis with Latent Diffusion Models [Paper] [Code] [Stable Diffusion Code]
    • Vector Quantized Diffusion Model for Text-to-Image Synthesis [Paper] [Code]
    • Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper] [Blog] [Risks and Limitations] [Unofficial Code]
    • Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Blog]
    • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [Paper] [Code]
    • Compositional Visual Generation with Composable Diffusion Models [Paper] [Code] [Project] [Hugging Face]
    • Prompt-to-Prompt Image Editing with Cross Attention Control [Paper] [Code] [Unofficial Code] [Project]
    • Creative Painting with Latent Diffusion Models [Paper]
    • DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics [Paper] [Project]
    • Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation [Paper]
    • ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts [Paper]
    • eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [Paper] [Project] [Video]
    • Multi-Concept Customization of Text-to-Image Diffusion [Paper] [Project] [Code] [Hugging Face]
  • 2023:

4. Generative Applications «🎯Back To Top»

4.1 Text-to-Image «🎯Back To Top»

Figure from paper

Figure 1. Diverse text-to-face results generated from GAN-based / Diffusion-based / Transformer-based models.

Images in orange boxes are captured from original papers (a) [zhou2021generative], (b) [pinkney2022clip2latent] and (c) [li2022stylet2i]; others are generated by a pre-trained model [pinkney2022clip2latent] [(b) left bottom row], Dreamstudio [(a-c) middle row] and DALL-E 2 [(a-c) right row] online platforms from textual descriptions.

Please refer to Section 3 (Generative Models) for more details about text-to-image.

4.2 Text-to-X «🎯Back To Top»

Figure from paper

Figure 2. Selected representative samples on Text-to-X.

Images are captured from original papers ((a) [ho2022imagen], (b)-Left [xu2022dream3d], (b)-Right [poole2022dreamfusion], (c) [tevet2022human]) and remade.

4.3 X-to-Image «🎯Back To Top»

Figure from paper

Figure 3. Selected representative samples on X-to-Image.

Images are captured from original papers and remade.

(a) Layered Editing [bar2022text2live] (Left), Recontextualization [ruiz2023dreambooth] (Middle), Image Editing [brooks2022instructpix2pix] (Right).
(b) Context-Aware Generation [he2021context] (Left), Model Complex Scenes [yang2022modeling] (Right).
(c) Face Reconstruction [dado2022hyperrealistic] (Left), High-resolution Image Reconstruction [takagi2022high] (Right).
(d) Speech to Image [wang2021generating] (Left), Sound Guided Image Manipulation [lee2022robust] (Middle), Robotic Painting [misra2023robot] (Right).
Legend: X excluding “Additional Input Image” (Blue dotted line box, top row). Additional Input Image (Green box, middle row). Ground Truth (Red box, middle row). Generated / Edited / Reconstructed Image (Black box, bottom row).

4.4 Multi Tasks «🎯Back To Top»

5. Discussion «🎯Back To Top»