Add SMILES standardizer code #23

shntnu · 2024-04-06T12:34:41Z

Moves over https://github.com/jump-cellpainting/compound-annotator/blob/main/StandardizeMolecule.py to here, after factoring in @srijitseal's code from https://github.com/jump-cellpainting/jump-cellpainting/pull/156
I had to use conda because rdkit is not on pypi
Added a .gitignore because I wasn't sure how you want to handle some ignores globally @afermg
Tests pass locally
Did not check coverage
The jump_canonical corresponds to the SMILES made public via Add SMILES to compounds.csv.gz jump-cellpainting/datasets#103 (comment) (except for 1 compound; see below)

import pandas as pd

df = pd.read_csv("/home/ec2-user/jump-cellpainting/3.standardize/standardize_ksiling_jumpmoa_jumptarget2/data/05_release/2022_10_18_JUMP-CP_compound_library_restandardized.csv", low_memory=False)
df = df.drop(columns=["jcp2020_id"])

df2 = df.loc[df.InChI_standardized != df.InChI_standardized_orig].drop_duplicates()

df2 = df2.transpose()
df2.columns = ['X1', 'X2']
df2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	X1	X2
SMILES_original	C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...	C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...
SMILES_standardized	CCC1NC(=O)C(C(C)O)NC(=O)c2csc(n2)C23CCC(c4nc(C...	CCC1NC(=O)C(C(C)O)NC(=O)c2csc(n2)C23CCC(c4nc(C...
InChI_standardized	InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...	InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...
InChIKey_standardized	AXHZBYJITSPJMH-UHFFFAOYSA-N	AXHZBYJITSPJMH-UHFFFAOYSA-N
jcp2022_id	JCP2022_091373	JCP2022_091373
pert_iname	thiostrepton	thiostrepton
InChIKey_orig	NSFFHOGKXHRQEW-DVRIZHICSA-N	NSFFHOGKXHRQEW-AIHSUZKVSA-N
InChIKey_standardized_orig	UTBOEBCWXGDOGI-UHFFFAOYSA-N	UTBOEBCWXGDOGI-UHFFFAOYSA-N
InChI_standardized_orig	InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...	InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...
jump_cp_control_type	NaN	NaN
pert_control_iname	NaN	NaN
Source[1]	Broad	Broad
Source[2]	NaN	NaN
Source[3]	NaN	NaN
Source[4]	NaN	NaN
Selection[1]	T	T
Selection[2]	NaN	NaN
Selection[3]	NaN	NaN

df2.loc[df2.X1 != df2.X2]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	X1	X2
InChIKey_orig	NSFFHOGKXHRQEW-DVRIZHICSA-N	NSFFHOGKXHRQEW-AIHSUZKVSA-N
jump_cp_control_type	NaN	NaN
pert_control_iname	NaN	NaN
Source[2]	NaN	NaN
Source[3]	NaN	NaN
Source[4]	NaN	NaN
Selection[2]	NaN	NaN
Selection[3]	NaN	NaN

afermg · 2024-04-08T11:58:42Z

I think there is some html leaking on the PR , such as " <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } ". I am going to presume that is not relevant to the PR itself.

shntnu · 2024-04-08T12:16:59Z

Yes you can ignore the leak

afermg

It requires some minor adjustments, but looks okay to me. Let me know if you think some of my comments are too cumbersome to implement. Also, I was able to install all dependencies using pip and lock them (See 0e53775, it uses poetry though). Would you consider adding that as an alternative dependency management solution? Space is already very limited on DGX and conda is well known for its massive venv sizes.

afermg · 2024-04-08T12:17:36Z

libs/smiles/environment.yml

+  - pandas=1.5.3
+  - numpy=1.24.2
+  - tqdm=4.64.1
+  - rdkit=2022.9.4


wrt rdkit, doesn't it have wheels for pip installations (https://pypi.org/project/rdkit/)? Admittedly, it doesn't support Windows so conda is probably the right call here.

Oh, bizarre, no idea why I didn't find it. Sounds good to me!

afermg · 2024-04-08T12:21:36Z

libs/smiles/src/smiles/standardize_smiles.py

+            standardized_df.to_csv(self.output, index=False)
+            return self.output
+        else:
+            return standardized_df


The 'run' method of StandardizeMolecule can return different data type (str if output is not None, pd.DataFrame otherwise). May I suggest to print/log the new location and always return the dataframe? Inconsistent output formats are an antipattern (See section 2.3 of this resource )

Sure – happy to switch to that

afermg · 2024-04-08T12:25:08Z

libs/smiles/src/smiles/standardize_smiles.py

+                "InChI_standardized",
+                "InChIKey_standardized",
+            ]
+            for column in new_columns:


IIUC this for loop would be equivalent to self.input_original.drop(new_columns, inplace=True, errors="Ignore").

afermg · 2024-04-08T12:48:06Z

libs/smiles/environment.yml

+  - numpy=1.24.2
+  - tqdm=4.64.1
+  - rdkit=2022.9.4
+  - pymysql=1.0.2


This is not being explicitly used. Is it a dependency for another package? IIRC python now contains sqlite3 so that may be an alternative.

You are right -- that's a holdover from https://github.com/jump-cellpainting/compound-annotator; will remove

afermg · 2024-04-08T12:55:43Z

libs/smiles/src/smiles/standardize_smiles.py

+            # Convert the InChI to InChIKey
+            inchikey_standardized = MolToInchiKey(mol_standardized)
+
+        except (ValueError, AttributeError) as e:


Not raising errors seems risky, if we have a fringe case that produces NaNs it will still pass the current tests. I'd suggest to leave the try-except blocks to the user, but this is a design call.

Good point; didn't notice

afermg · 2024-04-08T12:57:53Z

libs/smiles/src/smiles/standardize_smiles.py

+            logging.error(f"Standardization error, {smiles}, Error Type: {str(e)}")
+
+        # return as a dataframe
+        return pd.DataFrame(


This output is missing an explicit test (everything is tested together). Please add an explicit test to this, because it is the core functionality. Probably ensuring that each entry respects the SMILE/InChi(Key) format would suffice.

afermg · 2024-04-08T13:01:19Z

libs/smiles/test/test_data/smiles_data/JUMP-Target-2_compound_metadata_trimmed_input.tsv

This test input is not necessary, for testing you only need the smiles themselves (maybe wrapped in a dataframe). Additionally (this applies to the other tables too) If the other columns are superfluous to the analyisis they should be removed. This ensures that we are not reading noise (which is the case for all columns except for "smiles" here, and for all columns except for the original and standardized column in the other two validation tables.

afermg · 2024-04-08T13:02:03Z

...st/test_data/smiles_data/JUMP-Target-2_compound_metadata_trimmed_output_jump_alternate_1.csv

Requires columns cleanup

afermg · 2024-04-08T13:02:19Z

...test/test_data/smiles_data/JUMP-Target-2_compound_metadata_trimmed_output_jump_canonical.csv

Requires columns cleanup

afermg · 2024-04-08T13:07:09Z

libs/smiles/test/test_standardize_smiles.py

+    )
+
+    # Run the standardization process
+    standardizer.run()


After homogeneising the interface (so this always returns a dataframe). Then, checking the individual columns is important to find which process went wrong.

result = standardizer.run() for col in ("SMILES", "InChi", "InChiKey"): assert result[f"{col}_standardized"] == validation[f"{col}_standardized"]

Finally, add equivalent tests to _standardize_structure.

The goal of all this is to have fine-grained understanding of each processing branch. The current testing method only checks for dataframe equality, and that contains a lot of noise and additional formatting.

shntnu added 2 commits April 6, 2024 08:26

add initial code

555b72f

Update tests

bd6e3b9

shntnu mentioned this pull request Apr 6, 2024

Srijit Seal's standardizer jump-cellpainting/compound-annotator#8

Closed

Credits

912315d

shntnu requested a review from afermg April 6, 2024 12:39

shntnu changed the title ~~Add standardizer code~~ Add SMILES standardizer code Apr 6, 2024

afermg mentioned this pull request Apr 8, 2024

Relationship to jump-cellpainting/compound-annotator #22

Closed

afermg requested changes Apr 8, 2024

View reviewed changes

Rename

c0bafda

afermg mentioned this pull request Apr 8, 2024

Homogenise chemical sources #24

Open

afermg self-assigned this May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SMILES standardizer code #23

Add SMILES standardizer code #23

shntnu commented Apr 6, 2024 •

edited

afermg commented Apr 8, 2024

shntnu commented Apr 8, 2024

afermg left a comment

afermg Apr 8, 2024

shntnu Apr 8, 2024

afermg Apr 8, 2024

shntnu Apr 8, 2024

afermg Apr 8, 2024

afermg Apr 8, 2024

shntnu Apr 8, 2024

afermg Apr 8, 2024

shntnu Apr 8, 2024

afermg Apr 8, 2024 •

edited

afermg Apr 8, 2024

afermg Apr 8, 2024

afermg Apr 8, 2024

afermg Apr 8, 2024

Add SMILES standardizer code #23

Are you sure you want to change the base?

Add SMILES standardizer code #23

Conversation

shntnu commented Apr 6, 2024 • edited

afermg commented Apr 8, 2024

shntnu commented Apr 8, 2024

afermg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afermg Apr 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shntnu commented Apr 6, 2024 •

edited

afermg Apr 8, 2024 •

edited