Variant Calling and Ranking

Description

The advent of next-generation sequencing technology has enabled large-scale interrogation of the genome to identify variants in patient samples. The accurate identification of functional variants can provide critical insights into the disease process to guide diagnosis and treatment. However, the use of clinical genomics remains limited as (i) the accurate identification of variants remains suboptimal, and (ii) the large number of variants identified may be difficult to interpret without a systematic approach of ranking by functional importance. This is a software platform that analyses variant call data with a deep learning neural network to improve the accuracy of variant-calling, and thee uses a Bayesian classification method to rank functionally relevant genes.

Explanation of what this software was used to do can be found in the paper here.

Components

Analysis, Machine Learning and Ranking

The analysis software contain the feature extraction and engineering component that generates features from vcf data, the deep learning component that initialises the machine learning network and trains it using extracted features, and finally the Bayesian graphing component that performs Bayesian updating to rank the mutations in terms of importance using annotations from ANNOVAR. By integrating information of all five callers into a neural network, we allow the network to use features of each variant call to predict the probability of mutation being true. Our research shows that there is a significant difference in using such a neural network to analyse if mutations are true (see paper). Subsequently, ranking the mutations allows us to provide clinicians with a set of most important mutations that they can focus on.

Preprocessing and Analysis

The preprocessing and analytical components are implemented using Python 2.7 (Van Rossum, 2007) and the following Python libraries: NumPy, scikit-Learn, Pomegranate and PyVCF. Briefly, NumPy (v1.11.3) is used to prepare feature vectors for deep learning training, scikit-learn (v0.18.1) is used to perform Principal Component Analysis (PCA) and Synthetic Minority Oversampling Technique (SMOTE) methods (See Appendix 5.3.3 for details). PyVCF (v0.6.8) is used to parse the VCF files to facilitate the comparison of variants efficiently in O(1) time using hash-based dictionary lookups.

analysis/machinelearning contains the extraction of features from the vcf files, mainly with the methods found in extractfeaturesfromvcf.py.

Deep Learning Networks

Deep learning networks are implemented using the Keras library (v1.1.1) with a TensorFlow backend (v0.11.0). TensorFlow, from Google (Abadi et al., 2015), is used for better network training performance due to its distributed computation and queue management system. These network learn from features extracted from the VCF file (see paper section 2.7 for more details on feature extraction), and are used to create a likelihood probability of the mutation being true.

analysis/machinelearning contains the scripts that control initialisation and training of the neural network, particularly in generatematrixesforneuralnet.py and generateresultsforneuralnet scripts.

Bayesian Network Ranking of Mutations

For the Bayesian ranking of mutations, the high confidence calls from the deep learning network are annotated using ANNOVAR (v2015Jun17) (Wang, Li, & Hakonarson, 2010). The annotated features for each variant are used as inputs to the Bayesian network, which was implemented using Pomegranate (v0.6.1), a Python library for Bayesian analysis.

analysis/prediction contains the scripts that build the bayesian ranking network and the accompany graphs and networks.

Simulator

Simulators allow the generation of simulated datasets for analysis, which enables us to create known ground truths mutations by perturbing reference genomic datasets. This overcomes the difficulities of establishing ground truth mutations in real datasets, and serves as a preliminary source of data for neural network optimisation. Mason, a genome mutation software written in C++ (v2.3.1), is used to simulate sequence reads with known error rates and ground truth variants. Default error rates (indel and substitution rates) use published data from Schirmer et al. (2016).

simulators/scripts contains the base scripts that control the running of the simulator software.

Pipelining Using NextFlow

The workflows in the training and analysis pipelines are managed using NextFlow (v0.21.3.3990), a Groovy based Domain Specific Language (DSL) that provides easy management of parallel pipelines consisting of dependent tasks organised as a directed acyclic graph (Tommaso et al., 2014). Nextflow was used to manage and coordinate the different steps in the pipelines to ensure reproducibility and scalability.

simulators/pipelines contains all the pipelining software written in nextflow to automate simulator and variant calling processes.

Variant Calling and Alignment

Variant Callers are bioinformatics tools used to call mutations, which are specific genomic differences between a sample genome and a reference genome. However, individual variant callers suffer from having low concordances, and have high false positive rates. Here, we aggregate the data from five different variant callers to update and inform our deep learning network. The software used for variant calling are : FreeBayes (v1.0.2-16); GATK Haplotype Caller (v3.7-0) and Unified Genotyper (v3.7-0); Samtools (v1.3.1); Pindel (v2.3.0) (Garrison & Marth, 2012; McKenna et al. 2010, DePristo et al. 2011; Li H, et al., 2009; Ye et al., 2009).

simulators/scripts contains the base scripts that control the running of the variant calling software, as well as the options used to run the variant callers.

Overall Pipeline

Two main computational pipelines are present in this software : (i) a training pipeline for training and the optimisation of the neural network, and (ii) an analysis pipeline that uses a trained neural network to perform variant prediction and validation (See Below).

In the training pipeline, training datasets from synthetic and real sequencing data were used for performing the processing steps of alignment, variant calling and training of the deep learning network. In the analysis pipeline, the trained and optimised network from the training pipeline is then used to predict high-confidence variant calls in naive samples without ground truth variant calls. Finally, Bayesian network analysis is used to rank the functionally important variants/mutations from the high confidence calls identified from naive samples in the analysis pipeline.

Main documentation about this software can be found in the Introduction/Materials and Methods of here.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
analysis		analysis
docs		docs
simulators		simulators
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

docs

docs

simulators

simulators

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Variant Calling and Ranking

Description

Components

Analysis, Machine Learning and Ranking

Simulator

Overall Pipeline

About

Releases

Packages

Languages

License

chanedwin/mlmutation

Folders and files

Latest commit

History

Repository files navigation

Variant Calling and Ranking

Description

Components

Analysis, Machine Learning and Ranking

Simulator

Overall Pipeline

About

Topics

Resources

License

Stars

Watchers

Forks

Languages