Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to ignore accents (diacritics) #9

Closed
niksy opened this issue Sep 9, 2020 · 10 comments
Closed

Option to ignore accents (diacritics) #9

niksy opened this issue Sep 9, 2020 · 10 comments
Labels
enhancement New feature or request

Comments

@niksy
Copy link

niksy commented Sep 9, 2020

I’m using quick-score in accents/diacritics supported language (Croatian). Sometimes I will search with diacritics, sometimes not, but it would be nice to normalize string which is used to search items.

Currently, I’m using node-diacritics to remove/replace diacritics to standard ASCII characters, on search query and on results, but this returns results where diacritics are already removed instead of original item.

Maybe add option to transform query and item string?

import { QuickScore } from 'quick-score';
import diacritics from 'diacritics';
import traverse from 'traverse';

traverse(json).forEach(function(value) {
	if (this.notRoot) {
		if (typeof value === 'string') {
			this.update(diacritics.remove(value));
		}
	}
});

// …

search.addEventListener('input', (e) => {
	const result = qs.search(diacritics.remove(el.value));
});
@fwextensions
Copy link
Owner

fwextensions commented Sep 11, 2020

Yes, handling diacritics is on my todo list. I'm no expert, but based on some investigation, it seems like String.normalize() may be a modern replacement for the diacritics library, which hasn't been updated in years and has some bugs, based on this issue.

In the meantime, you could add strings without diacritics and then search on those values instead of the original ones, but show the original ones when rendering the results list. I'm not sure how your data is structured and am not familiar with the traverse package, but maybe something like this:

traverse(json).forEach(value => {
    if (this.notRoot && typeof value === "string") {
        // not sure if this will actually work...
        this.parent[this.key + "Scrubbed"] = diacritics.remove(value);
    }
});

// tell QuickScore to score the "scrubbed" strings, instead of the original ones
const qs = new QuickScore(json, ["titleScrubbed", "urlScrubbed", /* etc. */]);

search.addEventListener('input', (e) => {
    const result = qs.search(diacritics.remove(el.value));
    console.log(result.map(({title, url}) => `${title}: ${url}`).join("\n");
});

And then you could show the original title or url keys when rendering the list.

@fwextensions fwextensions added the enhancement New feature or request label Sep 11, 2020
@niksy
Copy link
Author

niksy commented Sep 11, 2020

I’ve just tried String.prototype.normalize on basic string from Croatian language: it’s not enough to use only that, you need to have some sort of mechanism in place to replace input characters to simple ones. That’s what node-diacritics does. Based on the PR you linked, normalization is used for accent characters split, not replacement.

Trick for scrubbed fields is a nice one, I will try that! Do you have ETA on when you would have first-class support for diacritics?

@fwextensions
Copy link
Owner

Can you give me some examples of strings where doing normalize() isn't enough? What they did in that PR was:

string.normalize('NFKD').replace(/[\u0300-\u036F]/g, '');

I believe that should break the characters into an ASCII character and a diacritic character, and then remove the diacritics.

you need to have some sort of mechanism in place to replace input characters to simple ones.

Shouldn't it be sufficient to apply the same diacritic removal approach to the query string before searching, as you do in your code example? Maybe I'm misunderstanding something.

There are some additional approaches I've looked at:
https://stackoverflow.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript
https://www.npmjs.com/package/latinize
https://alistapart.com/article/accent-folding-for-auto-complete/

It shouldn't be hard to add the diacritic filtering, once I've decided what the right approach is. That latinize package is basically doing what was suggested in one of the SO answers. It's not ES6-friendly, though.

I'm also not sure if I want to include it as a dependency, for those who don't need it. latinize is 8 times the size of the QuickScore library, due to the large character map. Maybe I could add an option to pass in a pre-process function, and if a user supplied it, the query would be matched against the pre-processed strings, but the scoreKey in the results would be for the original unprocessed string. The matches array might not be right, though, depending on what the pre-processor does to the strings. Maybe that could be packaged up as a helper library, that you'd only import if needed. I'll have to think about it a bit.

@niksy
Copy link
Author

niksy commented Sep 14, 2020

Can you give me some examples of strings where doing normalize() isn't enough? What they did in that PR was:

'čšđćž'.normalize('NFKD').replace(/[\u0300-\u036F]/g, '')

Third character, but that can be achieved with custom character replacement.

I'm also not sure if I want to include it as a dependency, for those who don't need it.

Yeah, this should probably be optional and implementation shouldn’t be diacritics specific, maybe something along the lines of function which has one argument (original) string, and returns processed string (which can be anything, and in this case, diacritics are removed).

@fwextensions
Copy link
Owner

Thanks for the example. I see what you mean.

I've pushed a branch that includes the preprocessor function option. You could use that with the latinize package to create a simple function that would remove all the diacritics. Something like:

import latinize from "latinize";

function preprocessString(
	string)
{
	return latinize(string.toLocaleLowerCase());
}

const qs = new QuickScore(items, { preprocessString });

I'll throw together an example repo that does this.

@fwextensions
Copy link
Owner

I created this repo, which uses the version of QuickScore from the feature/ignore-diacritics branch. The index.html file shows an example of how the preprocessString option could be used to ignore diacritics.

@niksy
Copy link
Author

niksy commented Nov 18, 2020

@fwextensions sorry it took so long for me to check this!

I’ve tried it and this works! I think this could be great addition to package.

@fwextensions
Copy link
Owner

No worries, thanks for trying it out!

This does seem like the simplest approach. I'm planning on adding this feature to the library. Just trying to decide between preprocessString() and transformString() as the option name (maybe just processString()?).

@niksy
Copy link
Author

niksy commented Nov 23, 2020

processString sounds good!

@fwextensions
Copy link
Owner

Sorry for the delay! This functionality is now in the latest package on npm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants