Scout: Custom tokenizer indexing properly to allow dashes and periods, but searching on dashes does not work #290

bretvanhorn · 2023-06-26T18:02:01Z

Hi, I have created a custom tokenizer that allows dashes, plus signs, and periods in the indexed keywords. I've verified it is correctly indexing such terms in the index:

However, when I search, it does not return any of the indexed items that should match:

Here is my config:

    'tntsearch' => [
	    'storage'  => storage_path(), //place where the index files will be stored
	    'fuzziness' => env('TNTSEARCH_FUZZINESS', false),
	    'fuzzy' => [
		    'prefix_length' =>  env('TNTSEARCH_FUZZY_LEN', 2),
		    'max_expansions' =>  env('TNTSEARCH_FUZZY_EXPANSIONS', 50),
		    'distance' =>  env('TNTSEARCH_FUZZY_DISTANCE', 2),
		    'no_limit' =>  env('TNTSEARCH_FUZZY_NO_LIMIT', false)
	    ],
	    'asYouType' => false,
	    'searchBoolean' => env('TNTSEARCH_BOOLEAN', false),
	    'maxDocs' => env('TNTSEARCH_MAX_DOCS', 500),
	    'tokenizer' => \App\Http\Classes\ItemTokenizer::class
    ],

And the .env vars referenced:

SCOUT_DRIVER=tntsearch
SCOUT_QUEUE=false
TNTSEARCH_FUZZINESS=true
TNTSEARCH_BOOLEAN=false
TNTSEARCH_MAX_DOCS=2500
TNTSEARCH_FUZZY_LEN=2
TNTSEARCH_FUZZY_EXPANSIONS=500
TNTSEARCH_FUZZY_DISTANCE=2
TNTSEARCH_FUZZY_NO_LIMIT=false

Here is my tokenizer:

	namespace App\Http\Classes;

	use TeamTNT\TNTSearch\Support\AbstractTokenizer;
	use TeamTNT\TNTSearch\Support\TokenizerInterface;

	class ItemTokenizer extends AbstractTokenizer implements TokenizerInterface {

		static protected $pattern = '/[^\p{L}\p{N}\.\+-]+/u';

	    public function tokenize($text, $stopwords = []) {
			return preg_split($this->getPattern(), strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
		}
	}

I am not sure if this issue lies with the Scout plugin or the core engine, so please let me know if I need to move this to the Scout plugin issues and/or any other info I can provide.

The text was updated successfully, but these errors were encountered:

nticaric · 2023-06-26T20:56:38Z

Thank you for bringing this to our attention. To help us understand the issue better, could you please try implementing a diagnostic step in your code?

You can use the dd() function within your custom tokenizer during a search operation. This way we'll see if it hits the correct tokenizer or the default one

public function tokenize($text, $stopwords = []) {
    $return = preg_split($this->getPattern(), strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
    dd($return);    
}

nticaric · 2023-06-26T20:57:43Z

And while testing this, please set the fuzziness to false

TNTSEARCH_FUZZINESS=false

bretvanhorn · 2023-06-26T21:03:22Z

@nticaric Thanks for the quick reply! So, I tried this and interestingly enough, there is no debug output. This search is being done via an API call, but when I go to the API server and view the request for the search in Debugbar, there is no dd output. I am not sure what I am doing given this revealation, but any suggestions would be welcome!

nticaric · 2023-06-26T21:08:37Z

Can you query the info table of the index? It could be that the original index was build with the default tokenizer

bretvanhorn · 2023-06-26T21:09:39Z

@nticaric, here you go:

bretvanhorn · 2023-06-26T21:13:37Z

Ok, I had forgotten to set the return as a var, so it was returning before the dd was called. Now it appears to be breaking the request, which tells me it is using the custom tokenizer.

If it helps, it almost looks as though it is struggling with numerical content at the beginning of the keyword. For example, sx-70 seems to return relevant results, but 70-200 returns items with either 70 or 200 in the name and a dash in the name outside of that, but prioritizes 200.

And I remember that the import seems to have indexed just simply "-" as a keyword, and I wonder if that is the issue. I am going to try to delete that keyword from the keywords and see what that does.

nticaric · 2023-06-26T21:30:02Z

Are you sure fuzziness is turned off?

bretvanhorn · 2023-06-26T21:36:50Z

Are you sure fuzziness is turned off?

Yep, confirmed. I realized my regexp is allowing spaces and dashes with spaces around them to be indexed as keywords. I am slow with Regexp, so I am trying to remedy that now.

bretvanhorn · 2023-06-26T22:20:55Z

Ok @nticaric here is the regex I am currently using:

static protected $pattern = '/[^\p{L}\p{N}\.\+-](?!\s-\s)+/u';

I am still seeing the behavior where 70-200mm does not return relevant results (items with "200" and "70" show up, but nothing with "70-200"), but sx-70 returns relevant results. Any thoughts or guidance is welcome. Again, I suck at regex, so perhaps there is a better way to say:

Do not allow the following characters to be treated as stop words:

a-z
A-Z
0-9
plus sign
hyphen
.

Again indexing seems to allow phrases like 70-200mm into the wordlist table, but searching for them does not yield expected results.

Thanks again for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scout: Custom tokenizer indexing properly to allow dashes and periods, but searching on dashes does not work #290

Scout: Custom tokenizer indexing properly to allow dashes and periods, but searching on dashes does not work #290

bretvanhorn commented Jun 26, 2023

nticaric commented Jun 26, 2023

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 •

edited

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 •

edited

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 •

edited

Scout: Custom tokenizer indexing properly to allow dashes and periods, but searching on dashes does not work #290

Scout: Custom tokenizer indexing properly to allow dashes and periods, but searching on dashes does not work #290

Comments

bretvanhorn commented Jun 26, 2023

nticaric commented Jun 26, 2023

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 • edited

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 • edited

nticaric commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023

bretvanhorn commented Jun 26, 2023 • edited

bretvanhorn commented Jun 26, 2023 •

edited

bretvanhorn commented Jun 26, 2023 •

edited

bretvanhorn commented Jun 26, 2023 •

edited