Skip to content

As part of the TREC-COVID challenge the Information Retrieval Research Group at Technische Hochschule Köln develops search and retrieval algorithms to support the search for relevant information on COVID-19.

Notifications You must be signed in to change notification settings

irgroup/trec-covid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

trec-covid

Submission details round #2

  1. irc_bm25_altmetric:
    This run submission combines a BM25 baseline with altmetrics. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query>, <question>, and <narrative> tags. We rerank the baseline by adding the logarithmized Altmetric Attention Score.
  2. irc_logreg_tfidf:
    This run submission combines a BM25 baseline with a logistic regression based reranker trained on tfidf-features in combination with relevance judgments of the first round. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query>, <question>, and <narrative> tags. Documents are reranked for those topics where relevance judgments are available (1-30), otherwise the baseline ranking remains unaltered (31-35).

Submission details round #1

Description:

As part of TREC-COVID, we submit automatic runs based on (pseudo) relevance feedback in combination with a reranking approach. The reranker is trained on relevance feedback data that is retrieved from PubMed/PubMed Central (PMC). The training data is retrieved with queries using the contents of the <query> tags only.

For each topic a new reranker is trained. We consider those documents retrieved by the specific topic query as relevant training data, and the documents of the other 29 topics as non-relevant training data. Given a baseline run, the trained system reranks documents.

The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query> tags only. For our reranker we use GloVe embeddings in combination with the Deep Relevance Matching Model (DRMM).

Our three run submissions differ by the way training data is retrieved from PubMed/PMC.

  1. irc_entrez:
    The first run is trained on titles and abstracts retrieved from the Entrez Programming Utilities API with "type=relevance".
  2. irc_pubmed:
    The second run is trained on titles and abstracts retrieved from PubMed's search interface with "best match". We scrape the PMIDs and retrieve the titles and abstracts afterwards.
  3. irc_pmc:
    The third run is trained on full text documents retrieved from PMC.

Workflow

workflow

Setup

Our retrieval pipeline relies on the following dependencies:
[docker][elasticsearch][requests][beautifulsoup][matchzoo]

  • Install docker. When running on SciComp (Ubuntu VM):
sudo usermod -aG docker $USER
  • Make virtual environment and activate it
python3 -m venv venv
source venv/bin/activate
  • Install requirements:
pip3 install -r requirements.txt

Run python3 and install nltk data:

python3 -m nltk.downloader punkt
./scripts/getDataSets.sh
  • Fetch data for 30 topics from PubMed (will be written to artifact directory with timestamp)
python3 scripts/fetchPubmedData.py
  • Convert embeddings from bin to txt
python3 scripts/convert_word2vec.py
  • Optional: Adapt settings in config.py

Baseline run

  • Download image and run Elasticsearch container
python3 scripts/docker-run.py
  • Index data
python3 scripts/index.py
  • Write baseline run file
python3 scripts/base.py
  • Optional: Delete the docker container and remove the image
python3 scripts/docker-rm.py

Reranking

  • Train model for each of the 30 topics and save models to ./artifact/model/<model-type>
python3 scripts/train.py
  • Rerank baseline ranking:
python3 scripts/rerank.py

config.py

param comment
DOCS dictionary with index names as keys and paths to data as values
BULK if set to True data is indexed in bulk
SINGLE_IDX if is not None, all data is indexed into one instance
TOPIC path to topic file
BASELINE name of the baseline run
DATA path to directory with subsets
META path to metadata.csv
VALID_ID path to xml file with valid doc ids
ESEARCH pubmed eutils api to retrieve pmids given a query term
EFETCH pubmed eutils to retrieve document data given one or more pmids
RETMODE datatype of pubmed eutils results
PUBMED_FETCH directory to fetched data from pubmed
PUBMED_DUMP_DATE specify date of pubmed data for training
MODEL_DUMP path to directory where model weights are stored
MODEL_TYPE specify model type. at the moment dense and drmm are supported
RUN_DIR path to the output runs
RERANKED_RUN name of the reranked run
PUBMED_SCRAPE bool. if set to True, pmids are scraped from pubmed frontend
PUBMED_FRONT URL of the pubmed frontend
RESULT_SIZE number of results to be retrieved from PUBMED_FRONT
RERANK_WEIGHT weight param for reranker score. default: 0.5
IMAGE_TAG
CONTAINER_NAME
FULLTEXT_PMC
RUN_TAG
ESEARCH_PMC
EFETCH_PMC
EMBEDDING
EMBED_DIR
BIOWORDVEC

Datasets Round 1

name link
comm commercial use subset
noncomm non-commercial use subset
custom custom license subset
biorxiv bioRxiv/medRxiv subset

About

As part of the TREC-COVID challenge the Information Retrieval Research Group at Technische Hochschule Köln develops search and retrieval algorithms to support the search for relevant information on COVID-19.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published