Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove matchings that could not align word boundary? #170

Open
zwd2080 opened this issue Jun 22, 2022 · 3 comments
Open

How to remove matchings that could not align word boundary? #170

zwd2080 opened this issue Jun 22, 2022 · 3 comments

Comments

@zwd2080
Copy link

zwd2080 commented Jun 22, 2022

The second matching (5, 'her' ) and the last one (14, 'she') are not aliging the word boundary, how to remove them ?
or could we force them to mathcing word?

 for idx, key in enumerate('he her hers she'.split()):
    A.add_word(key,  key) # 
 A.make_automaton()
 needle = "he here her shes"
 list(A.iter_long(needle))
# [(1, 'he'), (5, 'her'), (10, 'her'), (14, 'she')]
@pombredanne
Copy link
Collaborator

Are you saying that you only want to have whole words matched? If so then you do not want to add strings characters as words, but rather sequence of words converted to numbers, otherwise the automaton will be on characters and will match characters: it does not know anything about words.

@donatoaz
Copy link

Hi @pombredanne just to make sure I understand: the idea is that each unique word in the needles would map to a distinct int and we'd add these ints as keys and the words as the values?

Do you have a recommendation for this mapping? since the haystack will also need to mapped prior to iterating it with the same resulting map.

Thanks!

@explrA
Copy link

explrA commented Oct 2, 2023

@pombredanne

Can we get more info on this please.
I want exact(whole) word match and I am not able to understand how to approach it.
Any insights would be greatly appreciated

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants