Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library usage questions... #104

Open
tflorac opened this issue Dec 19, 2018 · 9 comments
Open

Library usage questions... #104

tflorac opened this issue Dec 19, 2018 · 9 comments

Comments

@tflorac
Copy link

tflorac commented Dec 19, 2018

Hi,

I'm just starting to discover and use your library to extract terms defined in a thesaurus from an input text, and "highlight" them in a HTML output.
It works quite well and quickly, anyway I still have a few issues:

  • is there any way to get only the first occurence of a given term when this term is present several times in the input text?
  • I have some use cases where a term can "contain" another one (for example, "Python", "IDE Python" and "Python language reference"). Actually, in this example, the automaton returns the term "Python" event when it is included into another matching term. Is there any way to select only one of them (for example the first matching term, or the longest one)?

Many thanks for any advise!

Best regards,
Thierry

@WojciechMula
Copy link
Owner

Hi Thierry, glad to hear you found pyahocorasick module useful.

There's no way to limit number of hits, you have to do this in your python code. But it seems to be an interesting feature. Would you open a feature request for this?

Automaton will always return all occurrences of strings from the set, even when they overlap. I think the approach from #21 is what you need, but it's not ready. Seems I'll need to back to this, you're another person who requests this.

@tflorac
Copy link
Author

tflorac commented Dec 19, 2018

Thanks for your reply!
I will open another ticket for a feature request!!
And if you need help for testing, just ask! ;)

@WojciechMula
Copy link
Owner

@tflorac thank you, I always need help, as my day usually has 24h only :)

@tflorac
Copy link
Author

tflorac commented Dec 20, 2018

I'll complete my initial question with another one...
I found a solution to only get the first found occurrence of a string, and I'm now looking for a solution to handle my second problem: how to handle use case where a term overlaps another one!
It seems that when overlap occurs at the end of the longest string (for example, with the terms "Python" and "Advanced Python"), the longest one (here "Advanced Python") appears first, but with the same position as the second one ; so I think that I can easilly handle this case by remembering the last position at which I already found a term. Can you confirm this point?
The problem persists when two terms overlap at the start ("Python" and "Python Book" for example); is there any rule in this case which defines the order in which found terms will be returned?

@WojciechMula
Copy link
Owner

@tflorac I'm afraid there is no such rule. The automaton work this way that if it matches the last character of some word, it also can figure out other words with the same suffix. But there's no notion of "the first character of match".

BTW, what should be highlighted if you have set ["Advanced Python", "Python Book"] and the string is "Advanced Python Book"? There is partial overlap.

@tflorac
Copy link
Author

tflorac commented Dec 21, 2018

@WojciechMula: well... so at first I'm going to follow a simple process where if two terms (or more) match at the same position, the first one will win and will be selected!

I don't have any rule actually to answer to your question in the use case you give. I can't define any priority between the three terms, except by giving a "weight" to each term (statically or, for example, based on their respective length)... :-/

@kootenpv
Copy link

kootenpv commented Dec 26, 2018

I also came here to request exactly this... I think the functionality "keep the longest match" will be very well received :)

@WojciechMula Oh and thanks for being so quick on responding to people's requests, I'm having a great time. I'm building a library on top of this: I pretty much finished it but only now realized the overlap "issue" :(

In my case I want to add 2 strings and count on the fact that only the longest one will be kept.

@kootenpv
Copy link

kootenpv commented Dec 26, 2018

I think this works as python logic, but it would be awesome if we could get the speed from c as I think this will be really slow:

from ahocorasick import Automaton

a = Automaton()

a.add_word("no", "no")
a.add_word("no thanks", "no thanks")
a.add_word("thanks", "thanks")

a.make_automaton()

ranges = [(x[0] - len(x[1]) + 1, x[0] + 1, x[1]) for x in a.iter("no thanks")]

# [(0, 2, 'no'), (0, 9, 'no thanks'), (3, 9, 'thanks')]

def remove_overlap(ranges):
    result = []
    current_stop = -1 
    for start, stop, value in ranges:
        if start > current_stop:
            # this segment starts after the last segment stops
            # just add a new segment
            result.append((stop - start, value))
            current_stop = stop
        elif stop - start > result[-1][0]:
            # segments overlap, replace
            result[-1] = (stop - start, value)
            # current_start already guaranteed to be lower
            current_stop = max(current_stop, stop)

    return [x[1] for x in result]

remove_overlap(ranges)
['no thanks']

In my case it seems that incorporating this code yields a ~15% slowdown.

@WojciechMula
Copy link
Owner

@kootenpv Thank you for this example! I think this request depends on #82.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants