Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the keyword phrase include a PRON, like "it" #271

Open
chencjiajy opened this issue Jan 4, 2024 · 2 comments
Open

why the keyword phrase include a PRON, like "it" #271

chencjiajy opened this issue Jan 4, 2024 · 2 comments
Labels

Comments

@chencjiajy
Copy link

I have run the following code snippet, the output including word "it", pos_kept don't include the PRON.

import spacy
import pytextrank

nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank", config={'pos_kept': ["NOUN", "PROPN", "VERB"]})

text = '''The MCU SDK for WRG1 general firmware has been launched, and it can be automatically generated after creating the product.'''
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    print(phrase.text, phrase.rank, phrase.count, phrase.chunks)

## the output is 
# the product 0.12286712485174818 1 [the product]
# WRG1 general firmware 0.10712303413227088 1 [WRG1 general firmware]
# The MCU SDK 0.0834726982382997 1 [The MCU SDK]
# it 0.0 1 [it]
@ceteri
Copy link
Collaborator

ceteri commented Jan 4, 2024

Hi @chencjiajy, great question.

The library considers noun chunks and apparently spaCy parses the term it as that.

The coreference capabilities for spaCy are currently marked "experimental", which is a nice way to say "Good luck installing and running this part in production" :) I've evaluated multiple options for coreference (including the AllenNLP integration) and they each seem to have serious limitations. That said, if these capabilities were available, it would be relatively simple to resolve a pronoun reference within the graph. In that case, the term it would add more weight to The MCU SDK instead.

If you want, the term it might be good to add to the stop words list for your application?

@chencjiajy
Copy link
Author

Hi, @ceteri , I found it's not useful to add item it to the stop words list, and the same as other single PRON words. Because pos_kept don't include the PRON, I don't need to add a single PRON word to stop words. In the code of function _collect_phrases atbase.py, pytextrank will exclude single PRON word that not be included in the pos_kept. So for single PRON word, it's rank will always be 0.0, So what I need to do is to filter the phrase it's rank is equal to zero.

        phrases: typing.Dict[Span, float] = {
            span: sum(
                ranks[Lemma(token.lemma_, token.pos_)]
                for token in span
                if self._keep_token(token)
            )
            for span in spans
        }

@chencjiajy chencjiajy reopened this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants