Skip to content

dalgu90/abbr-exp-ml4h

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

abbr-exp-ml4h

Implemenation of "Improved Clinical Abbreviation Expansion via Non-Sense-Based Approaches" (paper), ML4H (Machine Learning for Health) workshop at NeurIPS 2020.

This repository contains the non-sense-based (without gloss) and sense-based (with gloss) approaches to clinical abbreviation expansion based on BERT (The code of the one with permutation language model is coming soon in another repository). The code is based on BlueBERT (previously named as NCBI-BERT), which is a biomedical version of BERT.

Prerequisite

  1. Tensorflow 1.12+
  2. Pre-trained model of BlueBERT
  3. A clinical abbreviation expansion dataset (MSH, UMN, or ShARe/CLEF 2013 Task 2)

How to Run

# Install required python packages on your environment
$ pip install -r requirement.txt

# Download the BlueBERT parameters
$ wget https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12.zip
$ unzip NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12.zip -d bert_models

# Download and prepare dataset
$ ./scripts/download_umn.sh   # UMN
$ ./scripts/download_msh.sh   # MSH (downloading dataset required)
# Run the notebook scripts/preprocess_sc13t2.ipynb for the ShARe/CLEF dataset (manual downloading and installation required)

# Fine-tune and evaulate the model.
$ ./scripts/umn_masklm2.sh     # Masked LM on UMN, one of 10-fold CV
$ ./scripts/msh_masklm2_new.sh # Masked LM on MSH, one of 10-fold CV
$ ./scripts/sc13t2_masklm2.sh  # Masked LM on ShARe/CLEF. Please run scripts/evaluate_sc13t2_lrabr.ipynb to compute the accuracy on test unseen examples.

Acknowledgement

We thank the authors of BERT and BlueBERT for the implementation and the weights pre-trained on biomedical corpora.

Cite this work

@InProceeings{juyong2020improved,
  author    = {Juyong Kim and Linyuan Gong and Justin Khim and Jeremy C. Weiss and Pradeep Ravikumar},
  title     = {Improved Clinical Abbreviation Expansion via Non-Sense-Based Approaches},
  booktitle = {Proceedings of the Machine Learning for Health NeurIPS Workshop (ML4H 2020)}
  year      = {2020}
}

About

TensorFlow implementation of Improved Clinical Abbreviation Expansion via Non-Sense-Based Approaches, ML4H workshop at NeurIPS 2020

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published