Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly handle empty strings in charmaps #174

Merged
merged 1 commit into from Mar 26, 2023
Merged

correctly handle empty strings in charmaps #174

merged 1 commit into from Mar 26, 2023

Conversation

iliazeus
Copy link
Contributor

@iliazeus iliazeus commented Mar 21, 2023

Resolves #172

Resolves #95

@Trott
Copy link
Collaborator

Trott commented Mar 21, 2023

This introduces a new way to remove characters, but there is already the remove option.

slugify('cat', {remove: /a/ig}) // 'ct'

Not necessarily saying we shouldn't do this, but we might want to pause before introducing multiple ways to do the same thing. If this is a more ergonomic way, consider deprecating remove? If it's a less ergonomic way, don't implement it?

@iliazeus
Copy link
Contributor Author

iliazeus commented Mar 22, 2023

@Trott they're not the same semantically.

The "mapping into empty string" is related to the transliteration rules. It is a way to specify in the locale and/or charmap that we don't want a specific letter to be translated into anything. This is already in the charmap for ь and Ь, for example; this should also be in the Russian locale for ъ and Ъ, and there are probably more cases.

The remove argument is a way for the user to specify that they don't want specific characters in the final result. It does not depend on locale and charmap, and is usually used for things like punctuation. The user will probably not be aware of all the transliteration rules, and they won't put things like ь in there when specifying a custom regex.

@Trott
Copy link
Collaborator

Trott commented Mar 22, 2023

@Trott they're not the same semantically.

Thanks for the explanation. That makes sense and I understand now.

Comment on lines +284 to +286
delete require.cache[require.resolve('../')]
slugify = require('../')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this PR, but slug has a .reset() method to make this more ergonomic. Maybe slugify could do something similar.

@Trott
Copy link
Collaborator

Trott commented Mar 22, 2023

This is only useful for the twenty six letters of the Latin alphabet, right? For everything else, it already works as expected? Or am I wrong about that? (EDIT: Maybe it's useful for certain common symbols too?)

@iliazeus
Copy link
Contributor Author

@Trott my own usecase was actually similar to the one described in #172: the Cyrillic characters ь and Ь are (correctly) mapped into empty strings in charmap.json. It should also be done for ъ and Ъ when the Russian locale is added (right now they use Bulgarian transliterations in charmap.json). Thing is, those are proper letters, but since (in Russian, at least) they don't have their own proper sound, they are usually just omitted when transliterating (but in Bulgarian, Ъ isn't!). This feature is useful for these kinds of letters. I don't know enough non-latin and non-cyrillic scripts to offer other examples, though :)

The test uses latin characters for the sake of simplicity.

@Trott
Copy link
Collaborator

Trott commented Mar 23, 2023

@Trott my own usecase was actually similar to the one described in #172: the Cyrillic characters ь and Ь are (correctly) mapped into empty strings in charmap.json. It should also be done for ъ and Ъ when the Russian locale is added (right now they use Bulgarian transliterations in charmap.json). Thing is, those are proper letters, but since (in Russian, at least) they don't have their own proper sound, they are usually just omitted when transliterating (but in Bulgarian, Ъ isn't!). This feature is useful for these kinds of letters. I don't know enough non-latin and non-cyrillic scripts to offer other examples, though :)

The test uses latin characters for the sake of simplicity.

I'm sorry if I'm being foolish here and missing something obvious, but doesn't that already work without the change in this pull request?

const slugify = require('slugify');

const str = 'ъaъ';

console.log(slugify(str)); // 'uau'

slugify.extend({ 'ъ': ''})
console.log(slugify(str)) // 'a'

Maybe all that needs to happen is to add a Russian locale to config/locales.json?

@iliazeus
Copy link
Contributor Author

iliazeus commented Mar 23, 2023

@Trott sorry for being so unclear :)

The tricky part is: in your case, ъ is removed not by the extended charmap, but by the default remove regexp, which matches [^\w] (among others). Here is a quick test with a dummy remove:

> var slugify = require('slugify')
undefined
> slugify('ъяъ', { remove: /[]/ })
'uyau'
> slugify.extend({ 'ъ': '' })
undefined
> slugify('ъяъ', { remove: /[]/ })
'ъyaъ'

The order of steps for the current behavior is:

  • ъ is mapped by the charmap to ''
  • or-expressions on line 36 turn it into '' || 'ъ' === 'ъ'
  • it is then tested against remove

If someone that is not aware of language-specific rules specifies a custom remove, then mapping to empty string will stop working - it only did because of the default remove. This PR fixes that.

@iliazeus
Copy link
Contributor Author

@Trott we could instead just document that the remove regexp must contain the [^\w] class, but this would still be counter-intuitive, since it looks like it's mostly meant for punctuation and other non-letter, non-locale-dependent characters. The documented example does actually not match [^\w] :)

@iliazeus
Copy link
Contributor Author

I've updated the test to be a more relevant example of bugged behavior.

@Trott
Copy link
Collaborator

Trott commented Mar 24, 2023

I think I see now. FWIW, the slug module already works this way.

const slug = require("slug");
const slugify = require("slugify");

const str = "ъяъ";

console.log(slug(str)); // 'uyau'
console.log(slugify(str)); // 'uyau'

slug.extend({ ъ: "" });
slugify.extend({ ъ: "" });

console.log(slug(str)); // 'ya'
console.log(slugify(str)); // 'ya'

console.log(slug("ъяъ", { remove: /[]/g })); // 'ya'
console.log(slugify("ъяъ", { remove: /[]/g })); // 'ъyaъ'

@simov simov merged commit bbc56e2 into simov:master Mar 26, 2023
6 checks passed
@simov
Copy link
Owner

simov commented Mar 26, 2023

Published in v1.6.6

@Trott I also added this 3f0b3f5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cyrillic character in url causing encoding issues replacement and remove (question & suggestion)
3 participants