Feature interrelation profiling

The feature interrelation profiling (FIP) methodology was used to examine databases of natural products and commercially available compounds. The COCONUT database of NPs was selected for this analysis and the ZINC database was used to compile a reference dataset of commercially available compounds. The first step involved preprocessing of the COCONUT dataset and sampling similar structures among commercially available compounds from the ZINC database. The selected commercially available compounds were subsequently compared with NPs. The ECFPs and ECFP-like fragments were computed for the compounds. Finally, the FIP methodology was applied to the data to quantitatively determine the differences in feature combinations between NPs and generic commercially available compounds.

Step by step

01_11_2021

Coconut database analysis from sdf file

Downloaded COCONUT database from https://coconut.naturalproducts.net/download in SDF format
Made a dataframe with molecular weights and logP values of COCONUT data
Conducted basic analasis of COCONUT data: created graphs of log P and molecular weight values

10_11_2021

ZINC sampling

ZINC data were downloaded at https://zinc.docking.org/tranches/home/
zinc_database1.ipynb From ZINC database extracting substances that are similiar to the COCONUT data
zinc_analysis1.ipynb Analysing ZINC data

14_04_2022

Removing duplicates, creation of csv to easily acces the data

ZINC_analysis_07-04.ipynb:

Removing duplicates across sampled ZINC substances, exactly 10 substances with identical first INCHIKEY part.
Creating new csv with ZINC substances, containing Zinc_id, Smiles, MW, logP, Inchi(Inchi) and Inchikey first part (Inchi_s). Naming it ZINCFINAL.csv (379 012) substances.

COCO_to_CSV.ipynb:

Creating new csv containg COCONUT data. 'MW': db_mw, 'logP':db_logP, 'Smiles': db_smiles, 'Inchi': db_inchikey, 'coconut_id': db_COCONUT_id. Naming csv COCOALL.csv. (405 000 substances).
Then deleting approx. 20 000 duplicates by inchi first part getting 386 297 substances. Naming the csv COCOFINAl.csv.

ECFP_11_04.ipynb:

Creating fingerprints for COCOFINAL.csv and ZINCFINAL.csv using radius 2, lenght 2048
fp = Chem.GetMorganFingerprintAsBitVect(molecule, 2, nBits=2048, bitInfo=bitinfo), list(fp.GetOnBits())) bitinfo
Saving ECFP to ZINCECFP.csv
UPDATE: ECFP_15_04.ipynb: Added: ECFP of coco data - COCOEFCP.csv Deleted bitinfo (even in zincecfp.csv), left just ECFP and bitset

20_04_2022

18_04heavyatoms.ipynb:

Firstly, used: l.GetNumHeavyAtoms() to get the number of heavy atoms in molecule
Made boxplot and basic statistics using seaborn library and .describe()
Both of these made for ZINC and COCONUT data
For COCONUT data getting: mean 34.161340
For ZINC data getting: mean 28.009868

19-05-2022

Generated random dataset of COCONUT and ZINC:
Firstly, shuffled the whole dataframe and then splitted into 10 equal parts by using following:
suppl_csv = suppl_csv.sample(frac=1).reset_index(drop=True), df_split = np.array_split(suppl_csv, 10)
Reaching the subsets by: df_split[0] or df_split[1] etc.
Performed basis analysis of COCONUT and ZINC data
Compared these profiles

23_06_2022

Counting Relative feature tightness against a pKLD(COCONUT/ZINC) interrelation profile

In this folder you can find a jupyter notebooks(23_06relative_feature_tightness[0-9]), where I count Relative feature tightness against a pKLD interrelation profile. I have already splitted the dataset into 10 part. Therefore, for the train part using 90% of dataset and for the test part 10% of dataset.
Repeating this process described below 10 times, using different 10% of dataset. The results are on display in RFT.pdf in folder 23_06_2022/images/RFT.pdf.

Process of assigning RFT:

Making feature pointwise KL divergence profile between COCONUT and ZINC
Sampling structures from COCONUT and ZINC datasets
Relative feature tightness against a pKDL interrelation profile
Plotting a graph, these graphs are on display in PDF in folder 23_06_2022/images/RFT.pdf

23_06themostfrequentfragment.ipynb:

In the jupyter notebook 23_06themostfrequentfragment.ipynb I analyzed the most frequented combinations of fragments and also look at the Bits representing these fragments.

07_07_2022

Updated previous jupyter noteboks in file 23_06_2022:
Added ROC curves
You can find result images of ROC curves in the file: images
In folder fragments you can find preproccesing of COCONUT and ZINC fragments using funcition: fip.chem.rdmol2morgan_feature_smiles

10_08_2022

Creation of RFT profiles using ECFP-like fragments with radius 2

17_08_2022

Looking at the most frequent fragments in COCONUT dataset compared to ZINC dataset

28_03_2023

Adding 10 RFT profiles into one graph, adding visually nicer MW and log P graphs, looking at the most frequent fragments in COCONUT and ZINC dataset

Workflow of this work:

Conclusion

In this work, the feature interrelation profiling methodology was used to quantitatively analyze the substructure interrelation differences between natural products and generic commercially available compounds. The analysis revealed a notable difference in feature pair interrelations between natural products and commercially available compounds that could be effectively used to identify natural products and natural product-like substances. This approach is fully transparent and generic. Furthermore, it does not rely on any heuristics or specific models, while matching or exceeding their reported performance.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
01_11_2021		01_11_2021
07_07_2022		07_07_2022
10_08_2022		10_08_2022
10_11_2021		10_11_2021
14_04_2022		14_04_2022
17_08_2022		17_08_2022
19-05-2022		19-05-2022
20_04_2022		20_04_2022
23_06_2022		23_06_2022
28_03_2023		28_03_2023
README.md		README.md

liskovaf/COCONUT

Folders and files

Latest commit

History

Repository files navigation

Feature interrelation profiling

Step by step

01_11_2021

10_11_2021

14_04_2022

20_04_2022

19-05-2022

23_06_2022

07_07_2022

10_08_2022

17_08_2022

28_03_2023

Workflow of this work:

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages