Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwhelming number of synonyms #13

Closed
Lampent opened this issue Aug 8, 2022 · 6 comments
Closed

Overwhelming number of synonyms #13

Lampent opened this issue Aug 8, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@Lampent
Copy link

Lampent commented Aug 8, 2022

Hello,

Thank you for publishing this package. It is a highly beneficial resource.

When searching for synonyms, I noticed an unexpected behavior (bug).
For the word "good", the function find_synonyms() returns a list of 104 unique words. Among them are words that are not synonyms for the word "good." For example, "bully", "cracking", "bad", "boss", "hard", "spanking", and a couple of additional words that I am not sure if they are synonyms or not. The behavior is repeated with other words as well.

I am unsure if there is a specific website that enriches the synonyms with such words or if it is a bug in the crawling process. A possible solution may be to allow the selection of the websites on which the crawling process takes place.

I would highly recommend this option since I am unsure about the legitimacy of the other sources except for "merriam-webster" and "wordnet".

To date, I have decided to take the synonyms directly from "wordnet", as I cannot guarantee they are actually synonyms.

@johnbumgarner johnbumgarner self-assigned this Aug 9, 2022
@johnbumgarner johnbumgarner added the enhancement New feature or request label Aug 9, 2022
@johnbumgarner
Copy link
Owner

I originally had a way to filter by source, but I removed it from the code, because it caused confusion for some of my beta users.

Synonyms are enriched from 4 sources. All these sources have been validated in testing.

Concerning the word "bully." It is a synonyms of good and is being pulled from synonym.com. Here is a reference from the Oxford Languages dictionary:

Screen Shot 2022-08-08 at 7 52 05 PM

Here is a reference from the Oxford Languages dictionary for the word "spanking," which is also a synonyms of good.

Screen Shot 2022-08-08 at 7 59 57 PM

@johnbumgarner johnbumgarner pinned this issue Aug 9, 2022
@Lampent
Copy link
Author

Lampent commented Aug 9, 2022

Thank you for the detailed and quick response,

I am not sure why filtering by source was confusing. From my experience, allowing developers a wide range of options and possibilities, if implemented correctly, should not negatively affect the developer's experience.

With such a wide range of synonyms, having no sense of control over them is a double edge sword. I would suggest developing a solution that will provide an indication of whether a word source is informal/slang. In my opinion, filtering slang/informal cases is a common practice.

@johnbumgarner
Copy link
Owner

Do you know the level of difficulty for developing a solution that tries to classify words by formal and informal/slang? I'm not sure this is even possible without creating a backend data source that contains these relationship. Do you have any suggestions on how to develop this solution for the English language?

@Lampent
Copy link
Author

Lampent commented Aug 10, 2022

The data can be found in the source website themselves. As part of the crawling you can collect the tags of the source of the words from each website (both bully and spanking have informal tag). I have implemented this method while doing crawling on wiktionary, filtering slang/informal definitions of idioms.

@johnbumgarner
Copy link
Owner

I'm sorry, but your statement is not correct, because WordHoard does not pull from the Oxford Languages dictionary. I checked the sources that are queried by WordHoard and none of them provide this data. I could add a module to queried the Oxford dictionary, but this source requires a subscription to use. And querying Google search for this information will lead to captchas being thrown.

@johnbumgarner
Copy link
Owner

@Lampent I recently spent some time redesigning WordHoard to allow searching by individual sources. Please let me know if this redesign works better for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants