Genetic Mutations Multi-Class Classification

Project Overview

Cancer tumors can have thousands of genetic mutations, making it challenging to distinguish driver mutations that contribute to tumor growth from passenger mutations. Currently, clinical pathologists manually interpret genetic mutations based on text-based clinical literature, which is a time-consuming process. Machine learning offers a solution to expedite this task with high accuracy.

In this project, we create features from medical literature data and develop a machine learning algorithm to automatically classify genetic variations. We experiment with different models, including Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), and Naive Bayes, to find the most suitable model.

Aim

The primary objective of this project is to classify genetic mutations into nine classes based on medical literature.

Data Description

The dataset is divided into variants and text for training and test datasets. It includes the following features:

ID: The row identifier linking mutations to clinical evidence.
Gene: The gene where the genetic mutation is located.
Variation: The amino acid change for these mutations.
Class: A numerical classification of the genetic mutation (1-9).
Text: Clinical evidence used to classify the genetic mutation.

Note: files >100 mb are excluded

Tech Stack

Language: Python
Libraries: pymongo[srv] (used for database connectivity)

Approach

Data Reading
Data Analysis: Utilizing libraries such as pandas, numpy, pretty_confusion_matrix, matplotlib, and sklearn to analyze data related to Class, Gene, and Variation.
Text Preprocessing
Splitting Data, Evaluation, and Features Extraction
Model Building: Utilizing models like Logistic Regression, Random Forest, KNN, and Naive Bayes.
Hyperparameter Tuning: Focusing on hyperparameter tuning for Logistic Regression.
Model Evaluation: Including confusion matrix and log loss evaluation.

Modular Code Overview

lib: A reference folder containing the original iPython notebook.
ml_pipeline: A folder with functions organized into different Python files, called by the engine.py script to run the project steps and train the model.
requirements.txt: Lists all required libraries and their versions. Install them with pip install -r requirements.txt.
config.yaml: Contains project specifications.
readme.md: Detailed instructions for running the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

figures

figures

lib

lib

ml_pipeline

ml_pipeline

LICENSE

LICENSE

config.yaml

config.yaml

engine.py

engine.py

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

Genetic Mutations Multi-Class Classification

Project Overview

Aim

Data Description

Tech Stack

Approach

Modular Code Overview

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
figures		figures
lib		lib
ml_pipeline		ml_pipeline
LICENSE		LICENSE
config.yaml		config.yaml
engine.py		engine.py
readme.md		readme.md
requirements.txt		requirements.txt

License

AjNavneet/Genetic-Mutation-Classification

Folders and files

Latest commit

History

Repository files navigation

Genetic Mutations Multi-Class Classification

Project Overview

Aim

Data Description

Tech Stack

Approach

Modular Code Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Languages