trec-covid

Submission details round #2

irc_bm25_altmetric:
This run submission combines a BM25 baseline with altmetrics. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query>, <question>, and <narrative> tags. We rerank the baseline by adding the logarithmized Altmetric Attention Score.
irc_logreg_tfidf:
This run submission combines a BM25 baseline with a logistic regression based reranker trained on tfidf-features in combination with relevance judgments of the first round. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query>, <question>, and <narrative> tags. Documents are reranked for those topics where relevance judgments are available (1-30), otherwise the baseline ranking remains unaltered (31-35).

Submission details round #1

Description:

As part of TREC-COVID, we submit automatic runs based on (pseudo) relevance feedback in combination with a reranking approach. The reranker is trained on relevance feedback data that is retrieved from PubMed/PubMed Central (PMC). The training data is retrieved with queries using the contents of the <query> tags only.

For each topic a new reranker is trained. We consider those documents retrieved by the specific topic query as relevant training data, and the documents of the other 29 topics as non-relevant training data. Given a baseline run, the trained system reranks documents.

The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query> tags only. For our reranker we use GloVe embeddings in combination with the Deep Relevance Matching Model (DRMM).

Our three run submissions differ by the way training data is retrieved from PubMed/PMC.

irc_entrez:
The first run is trained on titles and abstracts retrieved from the Entrez Programming Utilities API with "type=relevance".
irc_pubmed:
The second run is trained on titles and abstracts retrieved from PubMed's search interface with "best match". We scrape the PMIDs and retrieve the titles and abstracts afterwards.
irc_pmc:
The third run is trained on full text documents retrieved from PMC.

Workflow

Setup

Our retrieval pipeline relies on the following dependencies:
[docker][elasticsearch][requests][beautifulsoup][matchzoo]

Install docker. When running on SciComp (Ubuntu VM):

sudo usermod -aG docker $USER

Make virtual environment and activate it

python3 -m venv venv
source venv/bin/activate

Install requirements:

pip3 install -r requirements.txt

Run python3 and install nltk data:

python3 -m nltk.downloader punkt

Download data from semanticscholar, extract it and place it in ./data/.

./scripts/getDataSets.sh

Fetch data for 30 topics from PubMed (will be written to artifact directory with timestamp)

python3 scripts/fetchPubmedData.py

Convert embeddings from bin to txt

python3 scripts/convert_word2vec.py

Optional: Adapt settings in config.py

Baseline run

Download image and run Elasticsearch container

python3 scripts/docker-run.py

Index data

python3 scripts/index.py

Write baseline run file

python3 scripts/base.py

Optional: Delete the docker container and remove the image

python3 scripts/docker-rm.py

Reranking

Train model for each of the 30 topics and save models to ./artifact/model/<model-type>

python3 scripts/train.py

Rerank baseline ranking:

python3 scripts/rerank.py

`config.py`

param	comment
DOCS	dictionary with index names as keys and paths to data as values
BULK	if set to `True` data is indexed in bulk
SINGLE_IDX	if is not `None`, all data is indexed into one instance
TOPIC	path to topic file
BASELINE	name of the baseline run
DATA	path to directory with subsets
META	path to `metadata.csv`
VALID_ID	path to xml file with valid doc ids
ESEARCH	pubmed eutils api to retrieve pmids given a query term
EFETCH	pubmed eutils to retrieve document data given one or more pmids
RETMODE	datatype of pubmed eutils results
PUBMED_FETCH	directory to fetched data from pubmed
PUBMED_DUMP_DATE	specify date of pubmed data for training
MODEL_DUMP	path to directory where model weights are stored
MODEL_TYPE	specify model type. at the moment `dense` and `drmm` are supported
RUN_DIR	path to the output runs
RERANKED_RUN	name of the reranked run
PUBMED_SCRAPE	bool. if set to `True`, pmids are scraped from pubmed frontend
PUBMED_FRONT	URL of the pubmed frontend
RESULT_SIZE	number of results to be retrieved from PUBMED_FRONT
RERANK_WEIGHT	weight param for reranker score. `default: 0.5`
IMAGE_TAG
CONTAINER_NAME
FULLTEXT_PMC
RUN_TAG
ESEARCH_PMC
EFETCH_PMC
EMBEDDING
EMBED_DIR
BIOWORDVEC

Datasets Round 1

name	link
`comm`	commercial use subset
`noncomm`	non-commercial use subset
`custom`	custom license subset
`biorxiv`	bioRxiv/medRxiv subset

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
doc		doc
qrels		qrels
scripts		scripts
submission		submission
topics		topics
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

qrels

qrels

scripts

scripts

submission

submission

topics

topics

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

trec-covid

Submission details round #2

Submission details round #1

Description:

Workflow

Setup

Baseline run

Reranking

`config.py`

Datasets Round 1

About

Releases

Packages

Contributors 3

Languages

irgroup/trec-covid

Folders and files

Latest commit

History

Repository files navigation

trec-covid

Submission details round #2

Submission details round #1

Description:

Workflow

Setup

Baseline run

Reranking

config.py

Datasets Round 1

About

Topics

Resources

Stars

Watchers

Forks

Languages

`config.py`