Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Peter Novig spell corrector for similar_names #300

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wangkuiyi
Copy link

@wangkuiyi wangkuiyi commented Feb 1, 2024

Guoli Yin pointed me to this code snippet for field name correction. I instantly thought of Peter Novig's spell checker, which is available at https://norvig.com/spell-check.html. This pull request makes advantage of Peter Novig's technique for name suggestions.

I'm preserving the original function name,'similar_names'. However, the new algorithm only suggests one name. I think this makes more sense than suggesting multiple choices because a class's collection of properties is typically far smaller than the English vocabulary, and users may prefer one accurate correction when they misspell.

Please let me know if I mis-interpret the problem here. Thanks!

@gyin94 gyin94 requested a review from markblee February 1, 2024 19:15
"""Use Peter Novig's spell correcter at https://norvig.com/spell-correct.html"""
word_count = Counter([_ for _ in candidates])

def P(word, N=sum(word_count.values())):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider a type hint and return type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should add the type hints if we merge this pull request; however, after more thoughts, I am no longer sure that we should merge Peter's algorithm.

This algorithm expands the misspelled word to a small set of similar words, and filter out those not in the vocabulary. This is because the vocabulary is too large to compute the editing distance between each word in it and the misspelled word.

However, a Python class wouldn't have this many candidate symbols as in a vocabulary. So it may be affordable to compute the pair-wise editing distance. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants