Skip to content

Latest commit

 

History

History
33 lines (32 loc) · 8.75 KB

⏳Recently Focused Papers.md

File metadata and controls

33 lines (32 loc) · 8.75 KB

⏳ℛℯ𝒸ℯ𝓃𝓉𝓁𝓎 ℱℴ𝒸𝓊𝓈ℯ𝒹 𝒫𝒶𝓅ℯ𝓇𝓈 (ℱ𝒴ℐ)⏳

  • ⭐(arXiv preprint 2023) Text-To-4D Dynamic Scene Generation, Uriel Singer et al. [Paper] [Project]
    • 🍬 MAV3D: the first generate 3D dynamic scenes given a text description. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos.
  • ⭐⭐(arXiv preprint 2023) Muse: Text-To-Image Generation via Masked Generative Transformers, Huiwen Chang et al. [Paper] [Project]
    • 🍬 Muse: a state-of-the-art model for text-to-image generation which achieves excellent FID and CLIP scores; significantly faster than comparable models; enables out-of-the-box, zero-shot editing capabilities including inpainting, outpainting, and mask-free editing.
  • ⭐(arXiv preprint 2022) ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, Zhida Feng et al. [Paper]
    • 🍬 ERNIE-ViLG 2.0: a large-scale Chinese text-to-image diffusion model, which progressively upgrades the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. ERNIE-ViLG 2.0 achieves state-of-the-art on MS-COCO with a zero-shot FID score of 6.75.
  • ⭐⭐(arXiv preprint 2022) Prompt-to-Prompt Image Editing with Cross Attention Control, Amir Hertz et al. [Paper] [Code] [Unofficial Code] [Project]
    • 🍬 Prompt-to-Prompt Editing: Control the attention maps of the edited image by injecting the attention maps of the original image along the diffusion process. Monitor the synthesis process by editing the textual prompt only, paving the way to a myriad of caption-based editing applications.
  • ⭐⭐(arXiv preprint 2022) Imagen Video: High Definition Video Generation with Diffusion Models, Jonathan Ho et al. [Paper] [Project]
    • 🍬 Imagen Video: Given a text prompt, Imagen Video generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. Imagen Video is not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding.
  • ⭐⭐(arXiv preprint 2022) Make-A-Video: Text-to-Video Generation without Text-Video Data, Uriel Singer et al. [Paper] [Project] [Short read] [Code]
    • 🍬 Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!
  • ⭐⭐(arXiv preprint 2022) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Nataniel Ruiz et al. [Paper] [Project]
    • 🍬 DreamBooth: Given as input just a few images of a subject and fine-tune a pretrained text-to-image model (Imagen), such that it learns to bind a unique identifier with that specific subject, which synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images.
    • 📚 Subject Recontextualization, Text-guided View Synthesis, Appearance Modification, Artistic Rendering (all while preserving the subject's key features)
  • ⭐(arXiv preprint 2022) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, Rinon Gal et al. [Paper] [Code] [Project]
    • Using only 3-5 images of a user-provided concept (an object or a style) learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way.
  • ⭐⭐(ECCV 2022) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, Oran Gafni et al. [Paper] [Code] [(Story)The Little Red Boat Story] [(Story)New Adventures]
    • 🍬 Make-A-Scene: Generate high fidelity images in a resolution of 512x512 pixels; Introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation.
  • ⭐(arXiv preprint 2022) NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis, Chenfei Wu et al. [Paper] [Code] [Project]
    • 🍬 NUWA-Infinity: A infinite visual synthesis model for arbitrarily-sized high-resolution images and long-duration videos generation.
  • ⭐⭐(arXiv preprint 2022) Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, Jiahui Yu et al. [Paper] [Code] [Project]
    • 🍬 Pathways Autoregressive Text-to-Image (Parti): Generate high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge; Treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language.
  • ⭐⭐(arXiv preprint 2022) Compositional Visual Generation with Composable Diffusion Models, Nan Liu et al. [Paper] [Code] [Project]
    • 🍬 This method is an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
  • ⭐⭐(arXiv preprint 2022) [Imagen] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Chitwan Saharia et al. [Paper] [Blog]
    • 🍬 Imagen: A text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, which builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation.
  • ⭐(OpenAI) [DALL-E 2] Hierarchical Text-Conditional Image Generation with CLIP Latents, Aditya Ramesh et al. [Paper] [Blog] [Risks and Limitations] [Unofficial Code]
    • 🍬 DALL-E 2: A two-stage model, a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding.
  • ⭐(arXiv preprint 2022) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, Wenyi Hong et al. [Paper] [Code]
    • 🍬 CogVideo: The first open-source large-scale pretrained text-to-video model, which is trained by inheriting a pretrained text-to-image model (CogView2) and outperforms all publicly available models at a large margin in machine and human evaluations.