Skip to content

HICAI-ZJU/Scientific-LLM-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 

Repository files navigation

Scientific Large Language Models (Sci-LLMs)

This repository collects papers on scientific large language models, particularly in the domains of biology and chemistry.

😎 Welcome to recommend missing papers through Adding Issues or Pull Requests.

🔔 News

Sci-LLMs-Scopes In this survey, we focus on scientific languages (i.e., textual, molecular, protein and genomic languages), as well as their combination (i.e., multimodal language).

🌟 Contents

📖 Textual Scientific Large Language Models (Text-Sci-LLMs)

Biology

  • 2019.05 BioBERT: a pre-trained biomedical language representation model for biomedical text mining, arXiv, Code
  • 2019.07 Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets, arXiv, Code
  • 2020.10 BioMegatron: Larger Biomedical Domain Language Model, arXiv, Code
  • 2020.10 Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, arXiv, Hugging Face
  • 2021.06 BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA, ACL Anthology, Code
  • 2022.03 LinkBERT: Pretraining Language Models with Document Links, arXiv, Code
  • 2023.03 BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, arXiv, Code
  • 2023.08 BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine, arXiv, Code
  • 2023.09 BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials, arXiv

Chemistry

Comprehensive

  • 2019.09 SciBERT: A Pretrained Language Model for Scientific Text, arXiv, Code
  • 2023.05 The Diminishing Returns of Masked Language Models to Science, arXiv, Hugging Face
  • 2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv, Code
  • 2024.01 SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning, arXiv, GitHub

Datasets and Benchmarks

  • MMLU, 2020.09. Measuring Massive Multitask Language Understanding, arXiv
  • C-Eval, 2023.05. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, arXiv
  • AGIEval 2023.05. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, arXiv
  • ScienceQA, 2022.09. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, arXiv
  • Xiezhi, 2023.06. Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation, arXiv
  • SciEval, 2023.08. SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research, arXiv
  • Bioinfo-Bench, 2023.10. A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation, bioRxiv
  • BLURB, 2020.07. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, arXiv
  • ARC, 2018.03. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv
  • SciQ, 2017.07. Crowdsourcing Multiple Choice Science Questions, arXiv

🧪 Molecular Large Language Models (Mol-LLMs)

Molecule Property Prediction

  • 2019.09 SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction, ACM-BCB, Code
  • 2019.11 SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery, arXiv, Code
  • 2020.02 Molecule attention transformer, arXiv, Code
  • 2020.10 ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction, arXiv, Code
  • 2020.10 Self-Supervised Graph Transformer on Large-Scale Molecular Data, arXiv, Code
  • 2020.11 Language models in molecular discovery, NeurIPS, Code
  • 2021.05 MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction, Briefings in Bioinformatics, Code
  • 2021.06 Algebraic graph-assisted bidirectional transformers for molecular property prediction, Nature Communications, Code
  • 2021.09 Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction, Wireless Communications and Mobile Computing, Code
  • 2021.10 Relative molecule self-attention transformer, Journal of Cheminformatics, Code
  • 2022.08 KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Code
  • 2022.09 ChemBERTa-2: Towards Chemical Foundation Models, arXiv, Code
  • 2022.01 Chemformer: a pre-trained transformer for computational chemistry, Mach. Learn.: Sci. Technol., Code
  • 2022.10 Large-Scale Distributed Training of Transformers for Chemical Fingerprinting, JCIM, Code
  • 2022.11 BARTSmiles: Generative Masked Language Models for Molecular Representations, arXiv, Code
  • 2022.12 Large-Scale Chemical Language Representations Capture Molecular Structure and Properties, arXiv, Code
  • 2022.12 Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration, Research, Code
  • 2023.01 MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction, Journal of Molecular Graphics and Modelling
  • 2023.01 Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs, arXiv, Code
  • 2023.02 UNI-MOL: A UNIVERSAL 3D MOLECULAR REPRESENTATION LEARNING FRAMEWORK, NeurIPS, Code
  • 2023.05 SELFORMER: MOLECULAR REPRESENTATION LEARNING VIA SELFIES LANGUAGE MODELS, arXiv, Code
  • 2023.07 Molecular Descriptors Property Prediction Using Transformer-Based Approach, IJMS

Interaction Prediction

  • 2020.12 X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis, bioRxiv, Code

Molecule Generation/Design/Edit

  • 2021.05 MolGPT: Molecular Generation Using a Transformer-Decoder Model, JCIM, Code
  • 2021.07 Transmol: repurposing a language model for molecular generation, RSC Advances, Code
  • 2021.09 GENERATIVE PRE-TRAINING FROM MOLECULES, ChemRxiv, Code
  • 2021.12 Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention, JCIM, Code
  • 2022.10 A Pre-trained Conditional Transformer for Target-specific De Novo Molecular Generation, arXiv
  • 2023.05 iupacGPT: IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation, ChemRxiv, Code
  • 2023.05 cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation, Molecules, Code
  • 2023.05 Molecule generation using transformers and policy gradient reinforcement learning, Scientific Reports, Code
  • 2023.10 DOMAIN-AGNOSTIC MOLECULAR GENERATION WITH SELF-FEEDBACK, arXiv, Code

Reaction Prediction

  • 2019.08 Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction, ACS Cent. Sci., Code
  • 2019.08 Molecular Transformer unifies reaction prediction and retrosynthesis across pharma chemical space, Chemical Communications
  • 2019.09 A Transformer Model for Retrosynthesis, ICANN, Code
  • 2019.12 Predicting Retrosynthetic Reaction using Self-Corrected Transformer Neural Networks, arXiv, Code
  • 2020.11 State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis, Nature Communications, Code
  • 2021.01 Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables, JCIM, Code
  • 2021.01 Prediction of chemical reaction yields using deep learning, Mach. Learn.: Sci. Technol., Code
  • 2021.03 Predicting Chemical Reaction Outcomes: A Grammar Ontology-based Transformer Framework, AIChE Journal
  • 2021.10 Molecular Graph Enhanced Transformer for Retrosynthesis Prediction, Neurocomputing, Code
  • 2021.10 PERMUTATION INVARIANT GRAPH-TO-SEQUENCE MODEL FOR TEMPLATE-FREE RETROSYNTHESIS AND REACTION PREDICTION, arXiv, Code
  • 2022.03 Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments, Nature Communications, Code
  • 2023.02 Enhancing diversity in language based models for single-step retrosynthesis, Digital Discovery, Code
  • 2023.07 Unbiasing Retrosynthesis Language Models with Disconnection Prompts, ACS Cent. Sci., Code

Datasets and Benchmarks

🧬 Protein Large Language Models (Prot-LLMs)

Protein Sequence Representation

  • 2020.02 Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences, PNAS, Code
  • 2021.02 MSA transformer, PMLR, Code
  • 2021.02 Multi-scale representation learning on proteins, Neurips
  • 2021.02 Language models enable zero-shot prediction of the effects of mutations on protein function, Neurips, Code
  • 2021.07 ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, Code
  • 2021.07 Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model, CoRR
  • 2021.09 Toward more general embeddings for protein design: Harnessing joint representations of sequence and structure, bioRxiv
  • 2022.02 ProteinBERT: a universal deep-learning model of protein sequence and function, bioRxiv, Code
  • 2022.04 Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction, bioRxiv, Code
  • 2022.05 Retrieved Sequence Augmentation for Protein Representation Learning, bioRxiv, Code
  • 2022.06 OntoProtein: Protein Pretraining With Gene Ontology Embedding, arXiv, Code
  • 2022.07 Language models of protein sequences at the scale of evolution enable accurate structure prediction, bioRxiv, Code
  • 2023.02 Multi-level Protein Structure Pre-training via Prompt Learning, ICLR, Code
  • 2023.02 Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling, arXiv, Code
  • 2023.10 Deciphering the protein landscape with ProtFlash, a lightweight language model, bioRxiv, Code
  • 2023.10 Enhancing protein language models with structure-based encoder and pre-training, arXiv, Code
  • 2023.10 Saprot: Protein language modeling with structure-aware vocabulary, bioRxiv, Code
  • 2023.12 ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers, bioRxiv

Protein Sequence Generation/Design

  • 2020.03 ProGen: Language Modeling for Protein Generation, arXiv, Code
  • 2021.01 A deep unsupervised language model for protein design, bioRxiv, Code
  • 2021.01 Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design, PMLR, Code
  • 2022.01 ZymCTRL: a conditional language model for the controllable generation of artificial enzymes, NeurIPS, Code
  • 2022.04 Few Shot Protein Generation, arXiv
  • 2022.05 RITA: a Study on Scaling Up Generative Protein Sequence Models, arXiv
  • 2022.12 Generative language modeling for antibody design, arXiv, Code
  • 2023.02 Structure-informed Language Models Are Protein Designers, bioRxiv
  • 2023.02 Generative power of a protein language model trained on multiple sequence alignments, Elife, Code
  • 2023.02 Protein sequence design in a latent space via model-based reinforcement learning, ICLR
  • 2023.06 Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation, arXiv, Code
  • 2023.07 ProstT5: Bilingual Language Model for Protein Sequence and Structure, bioRxiv, Code
  • 2023.07 xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein, bioRxiv
  • 2023.08 Efficient and accurate sequence generation with small-scale protein language models, bioRxiv
  • 2023.10 Generative Antibody Design for Complementary Chain Pairing Sequences through Encoder-Decoder Language Model, NeurIPS
  • 2023.10 ProGen2: exploring the boundaries of protein language models, Cell, Code
  • 2023.11 PoET: A generative model of protein families as sequences-of-sequences, arXiv

Datasets and Benchmarks

🦠 Genomic Large Language Models (Gene-LLMs)

General

  • 2021.02 DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome Bioinformatics
  • 2022.08 MoDNA: motif-oriented pre-training for DNA language model ACM-BCB
  • 2023.01 Species-aware DNA language modeling bioRxiv
  • 2023.01 The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics bioRxiv
  • 2023.06 HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution arXiv
  • 2023.06 DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome arXiv
  • 2023.06 GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences bioRxiv
  • 2023.06 Geneformer: Transfer learning enables predictions in network biology bioRxiv
  • 2023.07 EpiGePT: a Pretrained Transformer model for epigenomics bioRxiv
  • 2023.08 Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision bioRxiv
  • 2023.08 DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks bioRxiv
  • 2024.02 Evo: Sequence modeling and design from molecular to genome scale with Evo Nature

Function Prediction

  • 2021.10 Effective gene expression prediction from sequence by integrating long-range interactions Nature Methods
  • 2022.08 iEnhancer-BERT: A Novel Transfer Learning Architecture Based on DNA-Language Model for Identifying Enhancers and Their Strength ICIC 2022
  • 2022.10 iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations Genome Biology
  • 2022.12 iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models arXiv
  • 2023.03 miProBERT: identification of microRNA promoters based on the pre-trained model BERT Briefings in Bioinformatics
  • 2023.07 PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model Computers in Biology and Medicine
  • 2024.02 FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics ArXiv

Variants and Evolution Prediction

  • 2022.08 DNA language models are powerful predictors of genome-wide variant effects bioRxiv
  • 2022.10 GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics bioRxiv
  • 2023.10 GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction bioRxiv

DNA-Protein Interaction Prediction

RNA Prediction

  • 2023.02 Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction bioRxiv
  • 2023.03 Multiple sequence-alignment-based RNA language model and its application to structural inference bioRxiv
  • 2023.06 Prediction of Multiple Types of RNA Modifications via Biological Language Model IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • 2023.07 Uni-RNA: Universal Pre-trained Models Revolutionize RNA Research bioRxiv

Datasets and Benchmarks

Ⓜ️ Multimodal Scientific Large Language Models (MM-Sci-LLMs)

Molecule&text

  • 2021.11 Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries, EMNLP, Code
  • 2022.02 KV-PLM: A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, Nature, Code
  • 2022.09 MoMu: A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language, arXiv, Code
  • 2022.11 MolT5: Translation between Molecules and Natural Language, arXiv, Code
  • 2023.05 Text+Chem T5: Unifying Molecular and Textual Representations via Multi-task Language Modelling, arXiv, Code
  • 2023.05 DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs, techRxiv, Code
  • 2023.06 GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning, bioRxiv, Code
  • 2023.06 MolReGPT: Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective, arXiv, Code
  • 2023.06 ChatMol: Interactive Molecular Discovery with Natural Language, arXiv, Code
  • 2023.07 MolXPT: Wrapping Molecules with Text for Generative Pre-training, ACL
  • 2023.07 MolFM: A Multimodal Molecular Foundation Model, arXiv, Code
  • 2023.08 GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text, arXiv
  • 2023.10 GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction, arXiv, Code
  • 2023.12 MoleculeSTM: Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing, arXiv, Code

Protein&text

  • 2022.04 ProTranslator: zero-shot protein function prediction using textual description, arXiv, Code
  • 2023.02 ProteinDT: A Text-guided Protein Design Framework, arXiv
  • 2023.07 ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts, arXiv, Code
  • 2023.07 Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers, arXiv
  • 2023.10 InstructProtein: Aligning Human and Protein Language via Knowledge Instruction, arXiv
  • 2024.02 ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing, arXiv, Code
  • 2024.02 ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training, arXiv, Code

Protein&molecule

  • 2022.09 ChemBERTaLM: Exploiting pretrained biochemical language models for targeted drug design, Bioinformatics, Code
  • 2023.03 Deep generative model for drug design from protein target sequence, Journal of Cheminformatics , Code
  • 2023.06 DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins, bioRxiv, Code
  • 2023.10 DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening, arXiv

Comprehensive

  • 2022.11 Galactica: A Large Language Model for Science, arXiv, Code
  • 2023.02 BioTranslator: Multilingual translation for zero-shot biomedical classification using BioTranslator, Nature, Code
  • 2023.05 ChatDrug: ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback, arXiv, Code
  • 2023.08 BioMedGPT:A Pre-trained Language Model for Biomedical Text Mining, arXiv, Code
  • 2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv, Code
  • 2023.10 BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations, arXiv, Code
  • 2023.11 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models, arXiv, Code
  • 2024.01 BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs, arXiv, Code
  • 2024.02 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset, arXiv, Page, Model, Dataset
  • 2024.02 Sequence modeling and design from molecular to genome scale with Evo, bioRxiv, Code
  • 2024.02 BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning, arXiv, Code

Datasets and Benchmarks

Molecule&Text

  • ChEBI-20, 2021.11 Text2mol: Cross-modal molecule retrieval with natural language queries, EMNLP2021
  • PCdes, 2022.02 A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, Nature
  • MoMu, 2022.09 A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language, arXiv
  • PubChemSTM, 2022.12. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing, arXiv
  • ChEBL-dia, 2023.06 ChatMol: Interactive Molecular Discovery with Natural Language, arXiv
  • PubChemQA, 2023.08 BioMedGPT:A Pre-trained Language Model for Biomedical Text Mining, arXiv
  • MoleculeQA, 2024.03 MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension, arXiv

Protein&Text

  • SwissProtCLAP, 2023.02 ProteinDT: A Text-guided Protein Design Framework, arXiv
  • ProtDescribe, 2023.07 ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts, arXiv
  • Prot2Text, 2023.07 Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers, arXiv
  • UniProtQA, 2023.08 BioMedGPT:A Pre-trained Language Model for Biomedical Text Mining, arXiv
  • InstructProtein, 2023.10 InstructProtein: Aligning Human and Protein Language via Knowledge Instruction, arXiv

Protein&Molecule

Comprehensive

  • Galactica, 2022.11 Galactica: A Large Language Model for Science, arXiv
  • Scientific Knowledge Dataset, 2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv
  • Mol-Instructions, 2023.10 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models, arXiv
  • SMolInstruct, 2024.02 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset, arXiv

👥 Contributions

Citation

If you find this repository useful, please cite our paper:

@misc{zhang2024scientific,
      title={Scientific Large Language Models: A Survey on Biological & Chemical Domains}, 
      author={Qiang Zhang and Keyan Ding and Tianwen Lyv and Xinda Wang and Qingyu Yin and Yiwen Zhang and Jing Yu and Yuhao Wang and Xiaotong Li and Zhuoyi Xiang and Xiang Zhuang and Zeyuan Wang and Ming Qin and Mengyao Zhang and Jinlu Zhang and Jiyu Cui and Renjun Xu and Hongyang Chen and Xiaohui Fan and Huabin Xing and Huajun Chen},
      year={2024},
      eprint={2401.14656},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributors

Contact

Star History Chart