Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

The Cancer Genome Atlas (TCGA), a cancer genomics reference program, has molecularly characterized more than 20,000 primary cancer samples and paired normal samples covering 33 types of cancer. This joint effort between the NCI and the National Human Genome Research Institute began in 2006. In the twelve years since, TCGA has generated more than 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomics. These data have led to improvements in the ability to diagnose, treat and prevent cancer by helping to establish the importance of cancer genomics.

Contribution of this work

During the experimental process, the size of the dataset used was significantly increased in order to improve the diversity and representativeness of the data. This adjustment allowed the model to learn from a wider variety of examples, improving its generalization. In addition, adjustments were made to both models involved in this study, both the classification model and the Siamese-type model. A key element of the optimization was the implementation of the use of custom weights. This strategy allowed different weights to be assigned to different instances of the dataset based on the amount of samples present. Finally, a specification was introduced regarding the types of mutations, this allowed for greater precision in the analysis of genetic information. Numerous studies aimed at identifying a distinctive genomic signature for different types of cancer are being conducted in the current research landscape.

Requirements (tested)

Module	Version
tensorflow	2.15.0
torch	2.1.2
cuda	12.2

To install tensorflow follow this guide: link
To install and set up cuda and cudnn follow this guide:

Related work

For more information about our research project access the paper here: our paper
To view the other papers that have contributed to the cancer research study and on which we have commented follow this link: other papers

Technical informations - main.py

In this section we introduce technical informations and installing guides!

Download Dataset

Download from Google Drive all the files in the folder Dataset: LINK;
Files should be downloaded within a folder with the name dataset;
Copy the dataset folder and paste it inside the project in this way: /Detection-signature-cancer/code/dataset

Config Path

In this script there are some path that we are going to describe now:

Dataset

dataset_path: the dataset that we want to use (SNP_DEL_INS_CNA_mutations_and_variants has two);
encoded_path: the encoded of the dataset;

Classification

model_path: where the model will be saved or uploaded;
risultati_classification: results of the classification;

Siamese

siamese_path: where the model of the siamese network will be saved or uploaded;
risultati_siamese: results of the siamese network;

If you want to change the dataset to use either 0030 or 0005 (read the paper for the meanings) you only need to edit the string containing 0030 or 0005 and replace it with one of the two.

For example:

dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
                    "data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0030_dataPatient_mutations_and_variants.csv")

Becomes

dataset_path = ("dataset/data_mrna/SNP_DEL_INS_CNA_mutations_and_variants/"
                    "data_mrna_v2_seq_rsem_trasposto_normalizzato_deviazione_0005_dataPatient_mutations_and_variants.csv")

Or

model_path = "models/0030/classification/espressione_genomica_con_varianti_2LAYER/"

Becomes

model_path = "models/0005/classification/espressione_genomica_con_varianti_2LAYER/"

Boolean Variables

Always in the main.py script you can set some variables:

only_variant = False: if you use the dataset that contains only variations in gene mutations set this on True;
data_encoded = False: allows to generate the encoded of the dataset (if this is the first time you run the code leave the default value)
- False: encoded to be generated;
- True: load an encoded;
classification = True: run the classification;
siamese_net = True: run the siamese network;
siamese_variants = True: if you use the dataset that contains the variations in gene mutations set this on True;

The Siamese Network can only be launched if it has a classification model already trained and saved. In the project the classification model has already been trained. If you want to use the models in this project and not start experimenting again set the parameters in this way (example for 0030 dataset):

only_variant = False
data_encoded = True
classification = False
siamese_net = True
siamese_variants = True

To run the project run the main.py script.

Author & Contacts

Name

Description

Alberto Montefusco

Developer - Alberto-00

Email - a.montefusco28@studenti.unisa.it

LinkedIn - Alberto Montefusco

My WebSite - alberto-00.github.io

Alessandro Macaro

Developer - mtolkien

Email - a.macaro@studenti.unisa.it

LinkedIn - Alessandro Macaro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

exam

exam

papers

papers

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

Contribution of this work

Requirements (tested)

Related work

Technical informations - main.py

Download Dataset

Config Path

Boolean Variables

Author & Contacts

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
code		code
exam		exam
papers		papers
.gitignore		.gitignore
README.md		README.md

Alberto-00/Detection-signature-cancer

Folders and files

Latest commit

History

Repository files navigation

Leveraging gene expression and genomic varation for cancer prediction using one-shot learning

Contribution of this work

Requirements (tested)

Related work

Technical informations - main.py

Download Dataset

Config Path

Boolean Variables

Author & Contacts

About

Topics

Resources

Stars

Watchers

Forks

Languages