-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SMILES standardizer code #23
base: main
Are you sure you want to change the base?
Conversation
I think there is some html leaking on the PR , such as " <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } ". I am going to presume that is not relevant to the PR itself. |
Yes you can ignore the leak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires some minor adjustments, but looks okay to me. Let me know if you think some of my comments are too cumbersome to implement. Also, I was able to install all dependencies using pip and lock them (See 0e53775, it uses poetry though). Would you consider adding that as an alternative dependency management solution? Space is already very limited on DGX and conda is well known for its massive venv sizes.
libs/smiles/environment.yml
Outdated
- pandas=1.5.3 | ||
- numpy=1.24.2 | ||
- tqdm=4.64.1 | ||
- rdkit=2022.9.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt rdkit, doesn't it have wheels for pip installations (https://pypi.org/project/rdkit/)? Admittedly, it doesn't support Windows so conda is probably the right call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, bizarre, no idea why I didn't find it. Sounds good to me!
standardized_df.to_csv(self.output, index=False) | ||
return self.output | ||
else: | ||
return standardized_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'run' method of StandardizeMolecule can return different data type (str if output is not None, pd.DataFrame otherwise). May I suggest to print/log the new location and always return the dataframe? Inconsistent output formats are an antipattern (See section 2.3 of this resource )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure – happy to switch to that
"InChI_standardized", | ||
"InChIKey_standardized", | ||
] | ||
for column in new_columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this for loop would be equivalent to self.input_original.drop(new_columns, inplace=True, errors="Ignore")
.
libs/smiles/environment.yml
Outdated
- numpy=1.24.2 | ||
- tqdm=4.64.1 | ||
- rdkit=2022.9.4 | ||
- pymysql=1.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being explicitly used. Is it a dependency for another package? IIRC python now contains sqlite3 so that may be an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- that's a holdover from https://github.com/jump-cellpainting/compound-annotator; will remove
# Convert the InChI to InChIKey | ||
inchikey_standardized = MolToInchiKey(mol_standardized) | ||
|
||
except (ValueError, AttributeError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not raising errors seems risky, if we have a fringe case that produces NaNs it will still pass the current tests. I'd suggest to leave the try-except blocks to the user, but this is a design call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; didn't notice
logging.error(f"Standardization error, {smiles}, Error Type: {str(e)}") | ||
|
||
# return as a dataframe | ||
return pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output is missing an explicit test (everything is tested together). Please add an explicit test to this, because it is the core functionality. Probably ensuring that each entry respects the SMILE/InChi(Key) format would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test input is not necessary, for testing you only need the smiles themselves (maybe wrapped in a dataframe). Additionally (this applies to the other tables too) If the other columns are superfluous to the analyisis they should be removed. This ensures that we are not reading noise (which is the case for all columns except for "smiles" here, and for all columns except for the original and standardized column in the other two validation tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires columns cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires columns cleanup
) | ||
|
||
# Run the standardization process | ||
standardizer.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After homogeneising the interface (so this always returns a dataframe). Then, checking the individual columns is important to find which process went wrong.
result = standardizer.run()
for col in ("SMILES", "InChi", "InChiKey"):
assert result[f"{col}_standardized"] == validation[f"{col}_standardized"]
Finally, add equivalent tests to _standardize_structure.
The goal of all this is to have fine-grained understanding of each processing branch. The current testing method only checks for dataframe equality, and that contains a lot of noise and additional formatting.
conda
becauserdkit
is not on pypi.gitignore
because I wasn't sure how you want to handle some ignores globally @afermgjump_canonical
corresponds to the SMILES made public via Add SMILES to compounds.csv.gz jump-cellpainting/datasets#103 (comment) (except for 1 compound; see below)