Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document redlist #141

Open
eklem opened this issue Mar 6, 2022 · 0 comments
Open

Document redlist #141

eklem opened this issue Mar 6, 2022 · 0 comments
Assignees

Comments

@eklem
Copy link
Owner

eklem commented Mar 6, 2022

  • The stopword module is removing stopwords or in other words blacklisting words.
  • redlist is the opposite. A list of words that you don't want to "go extinct". It is not generic but connected to a text corpus. It's manually maintained and used automatically.
  • When using stopword-trainer you have a source of text. If this text corpus is not static, but growing, you can retrain this list when you get new text added to corpus.
  • For every source of text some words that you wouldn't consider a stopword, but it may end up defined as one. To keep it permanently out of the blacklist when retraining you add it to the redlist.
  • The combination of raw stopword data with all words in corpus + every words stopwordiness, the redlist and cutOff-number (how many words to use as stopwords) makes a stopword list for a given text corpus.
@eklem eklem self-assigned this Mar 6, 2022
@eklem eklem added this to To do in Browser-ready Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant