Library usage questions... #104

tflorac · 2018-12-19T08:58:31Z

Hi,

I'm just starting to discover and use your library to extract terms defined in a thesaurus from an input text, and "highlight" them in a HTML output.
It works quite well and quickly, anyway I still have a few issues:

is there any way to get only the first occurence of a given term when this term is present several times in the input text?
I have some use cases where a term can "contain" another one (for example, "Python", "IDE Python" and "Python language reference"). Actually, in this example, the automaton returns the term "Python" event when it is included into another matching term. Is there any way to select only one of them (for example the first matching term, or the longest one)?

Many thanks for any advise!

Best regards,
Thierry

WojciechMula · 2018-12-19T16:57:21Z

Hi Thierry, glad to hear you found pyahocorasick module useful.

There's no way to limit number of hits, you have to do this in your python code. But it seems to be an interesting feature. Would you open a feature request for this?

Automaton will always return all occurrences of strings from the set, even when they overlap. I think the approach from #21 is what you need, but it's not ready. Seems I'll need to back to this, you're another person who requests this.

tflorac · 2018-12-19T20:14:45Z

Thanks for your reply!
I will open another ticket for a feature request!!
And if you need help for testing, just ask! ;)

WojciechMula · 2018-12-19T20:38:38Z

@tflorac thank you, I always need help, as my day usually has 24h only :)

tflorac · 2018-12-20T10:01:16Z

I'll complete my initial question with another one...
I found a solution to only get the first found occurrence of a string, and I'm now looking for a solution to handle my second problem: how to handle use case where a term overlaps another one!
It seems that when overlap occurs at the end of the longest string (for example, with the terms "Python" and "Advanced Python"), the longest one (here "Advanced Python") appears first, but with the same position as the second one ; so I think that I can easilly handle this case by remembering the last position at which I already found a term. Can you confirm this point?
The problem persists when two terms overlap at the start ("Python" and "Python Book" for example); is there any rule in this case which defines the order in which found terms will be returned?

WojciechMula · 2018-12-20T19:45:55Z

@tflorac I'm afraid there is no such rule. The automaton work this way that if it matches the last character of some word, it also can figure out other words with the same suffix. But there's no notion of "the first character of match".

BTW, what should be highlighted if you have set ["Advanced Python", "Python Book"] and the string is "Advanced Python Book"? There is partial overlap.

tflorac · 2018-12-21T07:23:11Z

@WojciechMula: well... so at first I'm going to follow a simple process where if two terms (or more) match at the same position, the first one will win and will be selected!

I don't have any rule actually to answer to your question in the use case you give. I can't define any priority between the three terms, except by giving a "weight" to each term (statically or, for example, based on their respective length)... :-/

kootenpv · 2018-12-26T20:17:10Z

I also came here to request exactly this... I think the functionality "keep the longest match" will be very well received :)

@WojciechMula Oh and thanks for being so quick on responding to people's requests, I'm having a great time. I'm building a library on top of this: I pretty much finished it but only now realized the overlap "issue" :(

In my case I want to add 2 strings and count on the fact that only the longest one will be kept.

kootenpv · 2018-12-26T21:35:03Z

I think this works as python logic, but it would be awesome if we could get the speed from c as I think this will be really slow:

from ahocorasick import Automaton

a = Automaton()

a.add_word("no", "no")
a.add_word("no thanks", "no thanks")
a.add_word("thanks", "thanks")

a.make_automaton()

ranges = [(x[0] - len(x[1]) + 1, x[0] + 1, x[1]) for x in a.iter("no thanks")]

# [(0, 2, 'no'), (0, 9, 'no thanks'), (3, 9, 'thanks')]

def remove_overlap(ranges):
    result = []
    current_stop = -1 
    for start, stop, value in ranges:
        if start > current_stop:
            # this segment starts after the last segment stops
            # just add a new segment
            result.append((stop - start, value))
            current_stop = stop
        elif stop - start > result[-1][0]:
            # segments overlap, replace
            result[-1] = (stop - start, value)
            # current_start already guaranteed to be lower
            current_stop = max(current_stop, stop)

    return [x[1] for x in result]

remove_overlap(ranges)
['no thanks']

In my case it seems that incorporating this code yields a ~15% slowdown.

WojciechMula · 2019-01-06T13:54:51Z

@kootenpv Thank you for this example! I think this request depends on #82.

WojciechMula added the question label Dec 19, 2018

WojciechMula added enhancement and removed question labels Jan 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library usage questions... #104

Library usage questions... #104

tflorac commented Dec 19, 2018

WojciechMula commented Dec 19, 2018

tflorac commented Dec 19, 2018

WojciechMula commented Dec 19, 2018

tflorac commented Dec 20, 2018

WojciechMula commented Dec 20, 2018

tflorac commented Dec 21, 2018

kootenpv commented Dec 26, 2018 •

edited

kootenpv commented Dec 26, 2018 •

edited

WojciechMula commented Jan 6, 2019

Library usage questions... #104

Library usage questions... #104

Comments

tflorac commented Dec 19, 2018

WojciechMula commented Dec 19, 2018

tflorac commented Dec 19, 2018

WojciechMula commented Dec 19, 2018

tflorac commented Dec 20, 2018

WojciechMula commented Dec 20, 2018

tflorac commented Dec 21, 2018

kootenpv commented Dec 26, 2018 • edited

kootenpv commented Dec 26, 2018 • edited

WojciechMula commented Jan 6, 2019

kootenpv commented Dec 26, 2018 •

edited

kootenpv commented Dec 26, 2018 •

edited