Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed every time i try the "conda env create -f environment.yml", it will stop in solving environment here in the picture, and finally failed to continue, I remember last year I also met this problem and failed to try blink, and I install all the dependency before running the environment.yml #7

Open
YUANMENG-1 opened this issue Mar 5, 2024 · 9 comments

Comments

@YUANMENG-1
Copy link

          Excuse me, every time i try the "conda env create -f environment.yml", it will stop in solving environment here in the picture, and finally failed to continue, I remember last year I also met this problem and failed to try blink, and I install all the dependency before running the environment.yml

and whether in HPC or mac all met this
image

Originally posted by @YUANMENG-1 in #4 (comment)

@tharwood3
Copy link
Collaborator

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

@YUANMENG-1
Copy link
Author

Thanks for your suggestions, I could get the blink through mamba,
image

@YUANMENG-1
Copy link
Author

YUANMENG-1 commented Mar 6, 2024

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

Sorry I met another problem:

When running the demo data:

python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313

running information seems to be "The warning highlights a potential risk. Using a model trained with scikit-learn version 1.0.2 and loading it with version 1.4.1.post1 might lead to unexpected behavior or invalid results." this problem

use "conda install scikit-learn=1.0.2" can solve this:

INFO:root:Processing small.mgf
INFO:root:Processing medium.mgf
/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator DecisionTreeRegressor from version 1.0.2 when using version 1.4.1.post1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Traceback (most recent call last):
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in
main()
File "/public/home/yuanmy/blink/blink/blink.py", line 279, in main
regressor = pickle.load(out)
File "sklearn/tree/_tree.pyx", line 865, in sklearn.tree._tree.Tree.setstate
File "sklearn/tree/_tree.pyx", line 1571, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:

  • expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
  • got : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

@YUANMENG-1
Copy link
Author

Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:

mamba env create -f environment.yml

Let me know how it goes and if you need any additional help getting BLINK going.

Sorry again, this time occurred some new problems:which may be "can not find the charge column?"

python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313

INFO:root:Processing small.mgf
INFO:root:Processing medium.mgf
INFO:root:Input files read time: 2.8513012938201427 seconds, 7010 spectra
INFO:root:Discretization time: 2.101361535489559 seconds, 7010 spectra
INFO:root:Scoring time: 16.316090885549784 seconds, 6010000 comparisons
INFO:root:Prediction time: 22.110017083585262 seconds, 6010000 comparisons
Traceback (most recent call last):
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'charge'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/public/home/yuanmy/blink/blink/blink_cli.py", line 5, in
main()
File "/public/home/yuanmy/blink/blink/blink.py", line 334, in main
output = pd.merge(output, query_df['charge'], left_on='query', right_index=True)
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/frame.py", line 4090, in getitem
indexer = self.columns.get_loc(key)
File "/public/home/yuanmy/Data/miniconda3/envs/spectral_lib_matcher/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'charge'

@tharwood3
Copy link
Collaborator

Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook

@YUANMENG-1
Copy link
Author

Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook

Sorry again to bother you, When I wanted to compare two mgf files by imitating your mzml compared mgf tutorial in your link:
https://github.com/biorack/blink/blob/main/tutorial/blink_tutorial.ipynb
The results don't look right: (with the results file uploaded)

output_test.csv

The problems in the output file are as follows:

  1. The query columns and ref columns in columns 3 and 4 are all the titles of query, and columns 6 and 10 appear to be the titles of query and ref. Which one is correct? I can determine whether the two files are compared with each other or query and query themselves?
    2.spectrum_query and spectrum_ref do not look right, as they only contain intensity information (I understand spectrum_ref to be both Intensity and mz?).
  2. Is the above error because I used your protocol incorrectly? Since the upstream usually gets the merge.mgf file, I want to compare the two MGFS by changing your tutorial, as follows:

import sys
import blink
import pandas as pd

mgf_query = blink.open_msms_file('../SpectralEntropy-master/neg_slaw_modi7.mgf')
mgf_ref= blink.open_msms_file('../SpectralEntropy-master/all_gnps_neg.mgf')

discretized_spectra = blink.discretize_spectra(mgf_ref.spectrum.tolist(), mgf_query.spectrum.tolist(), mgf_ref.precursor_mz.tolist(), mgf_query.precursor_mz.tolist(),bin_width=0.001, tolerance=0.01, intensity_power=0.5, trim_empty=False, remove_duplicates=False, network_score=False)

%%time
S12 = blink.score_sparse_spectra(discretized_spectra)
S12['mzi']
S12['mzc']

filtered_S12 = blink.filter_hits(S12, min_matches=5, override_matches=20, min_score=0.6)
m = blink.reformat_score_matrix(filtered_S12)
df = blink.make_output_df(m)
df = df.sparse.to_dense()
df = pd.merge(df, mgf_ref.add_suffix("_ref"), left_on="ref", right_index=True)
df = pd.merge(df, mgf_query.add_suffix("_query"), left_on="query", right_index=True)

df.to_csv('output_test.csv', index=False)

@tharwood3
Copy link
Collaborator

No problem, I'm happy to help. Your code looks good, it seems like the issue is merging in the metadata. I think what you need to do is switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again. The "query" column in the output dataframe always corresponds to indices of the first set of spectra, while the "ref" column is the indices of the second set of spectra. This is easy to mix up, I do need to improve documentation on how this works. I recently fixed the same issue from the tutorial notebook itself, so check out the most recent version if you have an older one.

As far as your other question about the spectra, those look okay to me. It is less obvious when the arrays are converted to strings as a saved csv, but each spectrum is modeled as an array of two arrays. The first array is for m/z, and the second array is for their intensities. For instance, the first "spectrum_ref" entry is the following:

[[54.1031, 66.751411, 68.216377, 116.144203, 123.719292, 136.225159, 149.481293, 149.565704, 150.117798, 280.409973], [1980., 2953., 2763., 2169., 2030., 2355. 2194, 2409, 2400, 2254]]

The first list there are the m/z values, and the second are the intensities. Hopefully this helps!

@YUANMENG-1
Copy link
Author

I followed your suggestions "switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again"。

It appears to be getting stuck by output larger tables, and the size of the files being output is much larger than if the blinker.discretize_spectra () order had not been replaced.

Does the blink.discretize_spectra() order determine the side length of the entire computed sparse matrix, so whether it is database mgf or query mgf, the file with the higher number of spectra should be placed first? Is there a way to filter the output result file without the mz and intensity of query and ref?

image

@tharwood3
Copy link
Collaborator

My guess is that the reason your output is so much larger is that the metadata is now being associated correctly, though there could be something else going on. This appears to be a pretty big comparison, so the pd.merge adds a lot of extra content to the output dataframe. The order of the spectra in the discretize_spectra function shouldn't change the size of the score matrix. The algorithm is more efficient when the smaller set of spectra is first, but it shouldn't make a huge difference (query is typically smaller than ref).

This is more of a pandas question than a blink question, however, I can give you some suggestions. If you want to decrease the size of your outputs, you can filter the output dataframe by score or number of matches before adding metadata. If you already filtered those, then you can chose to only associate essential metdata instead of everything read from the mgf files with the merge. For instance:

df = pd.merge(df, mgf_ref[['pepmass', 'title']].add_suffix("_ref"), left_on="ref", right_index=True)

If I/O and file size is a concern, maybe look into using parquet files or similar, rather than csv. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants