Skip to content

Is aho-corasick a good option for short patterns (2 bytes), in short text (< 30 bytes)? #142

Answered by BurntSushi
masklinn asked this question in Q&A
Discussion options

You must be logged in to vote

Interestingly, I do not. I think your instincts about match my own, or are at least what I would try first. A 10 byte haystack is rather short, for example, and it is too short for even the SSE2 implementations of memchr. In the 10 byte haystack case, it's likely that the SWAR approach will be used. And if you get down below 8 bytes (assuming a 64-bit target), then it will just be a byte-at-a-time loop.

Another choice, especially if your needles are two bytes, is to try using the lower level packed substring routines directly. It's very data dependent, but for example, if most of your memchr searches produce a false positive, where as a two byte needle via Teddy (a vectorized packed subst…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@masklinn
Comment options

@BurntSushi
Comment options

Answer selected by masklinn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants