Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SMILES standardizer code #23

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add SMILES standardizer code #23

wants to merge 4 commits into from

Conversation

shntnu
Copy link

@shntnu shntnu commented Apr 6, 2024


import pandas as pd

df = pd.read_csv("/home/ec2-user/jump-cellpainting/3.standardize/standardize_ksiling_jumpmoa_jumptarget2/data/05_release/2022_10_18_JUMP-CP_compound_library_restandardized.csv", low_memory=False)
df = df.drop(columns=["jcp2020_id"])

df2 = df.loc[df.InChI_standardized != df.InChI_standardized_orig].drop_duplicates()
df2 = df2.transpose()
df2.columns = ['X1', 'X2']
df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
X1 X2
SMILES_original C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC... C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...
SMILES_standardized CCC1NC(=O)C(C(C)O)NC(=O)c2csc(n2)C23CCC(c4nc(C... CCC1NC(=O)C(C(C)O)NC(=O)c2csc(n2)C23CCC(c4nc(C...
InChI_standardized InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7... InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...
InChIKey_standardized AXHZBYJITSPJMH-UHFFFAOYSA-N AXHZBYJITSPJMH-UHFFFAOYSA-N
jcp2022_id JCP2022_091373 JCP2022_091373
pert_iname thiostrepton thiostrepton
InChIKey_orig NSFFHOGKXHRQEW-DVRIZHICSA-N NSFFHOGKXHRQEW-AIHSUZKVSA-N
InChIKey_standardized_orig UTBOEBCWXGDOGI-UHFFFAOYSA-N UTBOEBCWXGDOGI-UHFFFAOYSA-N
InChI_standardized_orig InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7... InChI=1S/C72H85N19O18S5/c1-14-26(3)47-63(105)7...
jump_cp_control_type NaN NaN
pert_control_iname NaN NaN
Source[1] Broad Broad
Source[2] NaN NaN
Source[3] NaN NaN
Source[4] NaN NaN
Selection[1] T T
Selection[2] NaN NaN
Selection[3] NaN NaN
df2.loc[df2.X1 != df2.X2]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
X1 X2
InChIKey_orig NSFFHOGKXHRQEW-DVRIZHICSA-N NSFFHOGKXHRQEW-AIHSUZKVSA-N
jump_cp_control_type NaN NaN
pert_control_iname NaN NaN
Source[2] NaN NaN
Source[3] NaN NaN
Source[4] NaN NaN
Selection[2] NaN NaN
Selection[3] NaN NaN

@shntnu shntnu requested a review from afermg April 6, 2024 12:39
@shntnu shntnu changed the title Add standardizer code Add SMILES standardizer code Apr 6, 2024
@afermg
Copy link
Collaborator

afermg commented Apr 8, 2024

I think there is some html leaking on the PR , such as " <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } ". I am going to presume that is not relevant to the PR itself.

@shntnu
Copy link
Author

shntnu commented Apr 8, 2024

Yes you can ignore the leak

Copy link
Collaborator

@afermg afermg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires some minor adjustments, but looks okay to me. Let me know if you think some of my comments are too cumbersome to implement. Also, I was able to install all dependencies using pip and lock them (See 0e53775, it uses poetry though). Would you consider adding that as an alternative dependency management solution? Space is already very limited on DGX and conda is well known for its massive venv sizes.

- pandas=1.5.3
- numpy=1.24.2
- tqdm=4.64.1
- rdkit=2022.9.4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt rdkit, doesn't it have wheels for pip installations (https://pypi.org/project/rdkit/)? Admittedly, it doesn't support Windows so conda is probably the right call here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, bizarre, no idea why I didn't find it. Sounds good to me!

standardized_df.to_csv(self.output, index=False)
return self.output
else:
return standardized_df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'run' method of StandardizeMolecule can return different data type (str if output is not None, pd.DataFrame otherwise). May I suggest to print/log the new location and always return the dataframe? Inconsistent output formats are an antipattern (See section 2.3 of this resource )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure – happy to switch to that

"InChI_standardized",
"InChIKey_standardized",
]
for column in new_columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this for loop would be equivalent to self.input_original.drop(new_columns, inplace=True, errors="Ignore").

- numpy=1.24.2
- tqdm=4.64.1
- rdkit=2022.9.4
- pymysql=1.0.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being explicitly used. Is it a dependency for another package? IIRC python now contains sqlite3 so that may be an alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- that's a holdover from https://github.com/jump-cellpainting/compound-annotator; will remove

# Convert the InChI to InChIKey
inchikey_standardized = MolToInchiKey(mol_standardized)

except (ValueError, AttributeError) as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not raising errors seems risky, if we have a fringe case that produces NaNs it will still pass the current tests. I'd suggest to leave the try-except blocks to the user, but this is a design call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; didn't notice

logging.error(f"Standardization error, {smiles}, Error Type: {str(e)}")

# return as a dataframe
return pd.DataFrame(
Copy link
Collaborator

@afermg afermg Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output is missing an explicit test (everything is tested together). Please add an explicit test to this, because it is the core functionality. Probably ensuring that each entry respects the SMILE/InChi(Key) format would suffice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test input is not necessary, for testing you only need the smiles themselves (maybe wrapped in a dataframe). Additionally (this applies to the other tables too) If the other columns are superfluous to the analyisis they should be removed. This ensures that we are not reading noise (which is the case for all columns except for "smiles" here, and for all columns except for the original and standardized column in the other two validation tables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires columns cleanup

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires columns cleanup

)

# Run the standardization process
standardizer.run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After homogeneising the interface (so this always returns a dataframe). Then, checking the individual columns is important to find which process went wrong.

result = standardizer.run()
for col in ("SMILES", "InChi", "InChiKey"):
    assert result[f"{col}_standardized"] == validation[f"{col}_standardized"]

Finally, add equivalent tests to _standardize_structure.

The goal of all this is to have fine-grained understanding of each processing branch. The current testing method only checks for dataframe equality, and that contains a lot of noise and additional formatting.

@afermg afermg self-assigned this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants