Option to ignore accents (diacritics) #9

niksy · 2020-09-09T08:22:44Z

I’m using quick-score in accents/diacritics supported language (Croatian). Sometimes I will search with diacritics, sometimes not, but it would be nice to normalize string which is used to search items.

Currently, I’m using node-diacritics to remove/replace diacritics to standard ASCII characters, on search query and on results, but this returns results where diacritics are already removed instead of original item.

Maybe add option to transform query and item string?

import { QuickScore } from 'quick-score';
import diacritics from 'diacritics';
import traverse from 'traverse';

traverse(json).forEach(function(value) {
	if (this.notRoot) {
		if (typeof value === 'string') {
			this.update(diacritics.remove(value));
		}
	}
});

// …

search.addEventListener('input', (e) => {
	const result = qs.search(diacritics.remove(el.value));
});

fwextensions · 2020-09-11T02:30:08Z

Yes, handling diacritics is on my todo list. I'm no expert, but based on some investigation, it seems like String.normalize() may be a modern replacement for the diacritics library, which hasn't been updated in years and has some bugs, based on this issue.

In the meantime, you could add strings without diacritics and then search on those values instead of the original ones, but show the original ones when rendering the results list. I'm not sure how your data is structured and am not familiar with the traverse package, but maybe something like this:

traverse(json).forEach(value => {
    if (this.notRoot && typeof value === "string") {
        // not sure if this will actually work...
        this.parent[this.key + "Scrubbed"] = diacritics.remove(value);
    }
});

// tell QuickScore to score the "scrubbed" strings, instead of the original ones
const qs = new QuickScore(json, ["titleScrubbed", "urlScrubbed", /* etc. */]);

search.addEventListener('input', (e) => {
    const result = qs.search(diacritics.remove(el.value));
    console.log(result.map(({title, url}) => `${title}: ${url}`).join("\n");
});

And then you could show the original title or url keys when rendering the list.

niksy · 2020-09-11T06:48:06Z

I’ve just tried String.prototype.normalize on basic string from Croatian language: it’s not enough to use only that, you need to have some sort of mechanism in place to replace input characters to simple ones. That’s what node-diacritics does. Based on the PR you linked, normalization is used for accent characters split, not replacement.

Trick for scrubbed fields is a nice one, I will try that! Do you have ETA on when you would have first-class support for diacritics?

fwextensions · 2020-09-12T02:33:28Z

Can you give me some examples of strings where doing normalize() isn't enough? What they did in that PR was:

string.normalize('NFKD').replace(/[\u0300-\u036F]/g, '');

I believe that should break the characters into an ASCII character and a diacritic character, and then remove the diacritics.

you need to have some sort of mechanism in place to replace input characters to simple ones.

Shouldn't it be sufficient to apply the same diacritic removal approach to the query string before searching, as you do in your code example? Maybe I'm misunderstanding something.

There are some additional approaches I've looked at:
https://stackoverflow.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript
https://www.npmjs.com/package/latinize
https://alistapart.com/article/accent-folding-for-auto-complete/

It shouldn't be hard to add the diacritic filtering, once I've decided what the right approach is. That latinize package is basically doing what was suggested in one of the SO answers. It's not ES6-friendly, though.

I'm also not sure if I want to include it as a dependency, for those who don't need it. latinize is 8 times the size of the QuickScore library, due to the large character map. Maybe I could add an option to pass in a pre-process function, and if a user supplied it, the query would be matched against the pre-processed strings, but the scoreKey in the results would be for the original unprocessed string. The matches array might not be right, though, depending on what the pre-processor does to the strings. Maybe that could be packaged up as a helper library, that you'd only import if needed. I'll have to think about it a bit.

niksy · 2020-09-14T07:48:19Z

Can you give me some examples of strings where doing normalize() isn't enough? What they did in that PR was:

'čšđćž'.normalize('NFKD').replace(/[\u0300-\u036F]/g, '')

Third character, but that can be achieved with custom character replacement.

I'm also not sure if I want to include it as a dependency, for those who don't need it.

Yeah, this should probably be optional and implementation shouldn’t be diacritics specific, maybe something along the lines of function which has one argument (original) string, and returns processed string (which can be anything, and in this case, diacritics are removed).

fwextensions · 2020-09-20T03:00:29Z

Thanks for the example. I see what you mean.

I've pushed a branch that includes the preprocessor function option. You could use that with the latinize package to create a simple function that would remove all the diacritics. Something like:

import latinize from "latinize";

function preprocessString(
	string)
{
	return latinize(string.toLocaleLowerCase());
}

const qs = new QuickScore(items, { preprocessString });

I'll throw together an example repo that does this.

fwextensions · 2020-09-21T00:16:19Z

I created this repo, which uses the version of QuickScore from the feature/ignore-diacritics branch. The index.html file shows an example of how the preprocessString option could be used to ignore diacritics.

niksy · 2020-11-18T11:34:16Z

@fwextensions sorry it took so long for me to check this!

I’ve tried it and this works! I think this could be great addition to package.

fwextensions · 2020-11-23T01:07:57Z

No worries, thanks for trying it out!

This does seem like the simplest approach. I'm planning on adding this feature to the library. Just trying to decide between preprocessString() and transformString() as the option name (maybe just processString()?).

niksy · 2020-11-23T06:44:34Z

processString sounds good!

fwextensions · 2021-01-02T21:38:10Z

Sorry for the delay! This functionality is now in the latest package on npm.

fwextensions added the enhancement New feature or request label Sep 11, 2020

fwextensions closed this as completed in 4d84ced Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to ignore accents (diacritics) #9

Option to ignore accents (diacritics) #9

niksy commented Sep 9, 2020 •

edited

fwextensions commented Sep 11, 2020 •

edited

niksy commented Sep 11, 2020

fwextensions commented Sep 12, 2020

niksy commented Sep 14, 2020

fwextensions commented Sep 20, 2020

fwextensions commented Sep 21, 2020

niksy commented Nov 18, 2020

fwextensions commented Nov 23, 2020

niksy commented Nov 23, 2020

fwextensions commented Jan 2, 2021

Option to ignore accents (diacritics) #9

Option to ignore accents (diacritics) #9

Comments

niksy commented Sep 9, 2020 • edited

fwextensions commented Sep 11, 2020 • edited

niksy commented Sep 11, 2020

fwextensions commented Sep 12, 2020

niksy commented Sep 14, 2020

fwextensions commented Sep 20, 2020

fwextensions commented Sep 21, 2020

niksy commented Nov 18, 2020

fwextensions commented Nov 23, 2020

niksy commented Nov 23, 2020

fwextensions commented Jan 2, 2021

niksy commented Sep 9, 2020 •

edited

fwextensions commented Sep 11, 2020 •

edited