Automate the construction of a Knowledge Graph containing interactions for any given gene.
The exponential accumulation of biological data presents a formidable challenge when it comes to integration of new knowledge leading to actionable insights. The bioNX
project employs automated Knowledge Graph creation of protein-protein interaction networks using Neo4j as a way to demonstrate how such integration can be done.
Using a graph database makes it possible to explore the context and relationships in the data using various algorithms:
- Community detection
- Centrality to measure the importance of a node
- Prediction of properties based on similarity
- Prediction of undiscovered relationships
- Finding the shortest paths between nodes
bioNX
is a work in progress.
- bioGRID - primary data source for PPIs
- HGNC - Gene nomenclature reference
- PubMed - Literature
- Uniprot - Protein properties (pending implementation)
- Entrez - Gene properties (pending implementation)
- GO - Gene properties (pending implementation)
Clone repo:
git clone https://github.com/abk7777/bioNX
Install Python libraries:
cd bioNX
pip install -r requirements.txt
Update the .env
file with the correct values:
BIOGRID_ACCESS_KEY=<BIOGRID_ACCESS_KEY>
NEO4J_USERNAME=<NEO4J_USERNAME>
NEO4J_PASSWORD=<NEO4J_PASSWORD>
NEO4J_BOLT_URL=bolt://localhost:7687
NEO4J_HOME=<NEO4J_HOME>
Make the data directory:
cd bioNX
mkdir -p data/clean/
Start Jupyter Notebook:
cd notebooks/ && \
jupyter notebook
Open the notebook 0.1-biogrid-data.ipynb
and run its contents. This will output a file named biogrid_ppi_data.csv
to the import directory in the $NEO4J_HOME
folder and place a copy of it in the data/clean/
directory for easy access.
To specify a gene, update the gene
parameter under the section Select Gene. Take note that API requests are throttled to 10 per second, which means that it is wise to limit the results using the limit
parameter so it doesn't take forever to fetch the data.
The simplest way to load the graph into Neo4j is to copy and paste the neo4j/load.cyp
script into Neo4j and run it.
Example Cypher query returning genes, interactions, and author for MTHFR gene mentioned in PubMed article "26186194":
MATCH (gene1:Gene { name: 'MTHFR' })-[:INTERACTS_WITH]-(gene2:Gene),
(gene1)-[:MENTIONED_IN]->(article:Article { pubmed_id:"26186194" })<-[:MENTIONED_IN]-(gene2),
(article)<-[:PUBLISHED]-(author:Author),
(gene1)-[:INTERACTOR_IN]->(interaction:Interaction)<-[:INTERACTOR_IN]-(gene2)
RETURN gene1, gene2, author, article, interaction;
Running load.cyp
in Neo4j will produce a graph containing the following schema:
(Gene)-[:INTERACTOR_IN]->(Interaction)
(Gene)-[:INTERACTS_WITH]-(Gene)
(Interaction)-[:MENTIONED_IN]->(Article)
(Gene)-[:MENTIONED_IN]->(Article)
(Author)-[:PUBLISHED]->(Article)
See the open issues for a list of proposed features (and known issues). The current planned implementation includes:
- Expand graph schema with nodes for:
- Protein complexes
- Cofactors
- RNAs
- KEGG Pathways
- Post-Translation Modifications
- Chromosome loci
- Subcellular location
- Tissue
- Organ
- Disease Condition
Please feel free to include suggestions for things like:
- Nodes, relationships and properties
- Data sources
- Functionality and features
- Bug fixes
While the project is still getting off the ground please feel free to start a discussion in the open issues.
Gregory Lindsey - @abk7x4 - gclindsey@gmail.com
Project Link: https://github.com/abk7777/bioNX