Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for replacement char #2

Open
makew0rld opened this issue Mar 26, 2021 · 5 comments
Open

Option for replacement char #2

makew0rld opened this issue Mar 26, 2021 · 5 comments

Comments

@makew0rld
Copy link

Instead of removing characters that can't be translated, it'd be nice to have an option to replace them with a character.

For some languages (like Python) this could be added as a new argument with a default value, like replace="". For others (like Go) this would have to be a new function.

@hunterwb
Copy link
Member

A goal is to support all characters in Unicode, which would make replacements unnecessary. It is already practically the case that users will never encounter unsupported characters. Currently the only missing characters are in the following blocks/scripts: CJK, CUNEIFORM, BAMUM_SUPPLEMENT, TANGUT, KHITAN_SMALL_SCRIPT, DUPLOYAN, BYZANTINE_MUSICAL_SYMBOLS, MUSICAL_SYMBOLS, SUTTON_SIGNWRITING. I am planning on adding support for many of these. I don't want to confuse users into thinking they have to worry about unsupported characters. Other cases involving unassigned or Special code points I think are out of scope and should be handled elsewhere if necessary.

@makew0rld
Copy link
Author

Ah, I see. I still think it could be useful, as you never know what strange characters might appear in a string, and it's better to replace than to strip IMO. But I didn't realize so much of Unicode was covered, that's great.

@vovikdrg
Copy link

vovikdrg commented Jun 8, 2021

This is also be useful for different languages for instance my name Володимир in Ukrainian should be Volodymyr(Not Volodimir), but same name Владимир in Russian should be Vladimir

Even in test its wrong check("Володимир Горбулін", "Volodimir Gorbulin"); this is Ukrainian name since Russian dont have i. So right translation should be Volodymyr Gorbulin. (https://en.wikipedia.org/wiki/Volodymyr_Horbulin)

PS. I am happy to contribute

@hunterwb
Copy link
Member

hunterwb commented Aug 8, 2021

If you would like custom replacements for specific characters I would suggest doing them yourself before calling anyascii.

public static String transliterate(String s) {
    s = s.replace('Г,'H').replace('и', 'y'); // etc
    return AnyAscii.transliterate(s);
}

However Ukrainian like most languages requires context like look-ahead for correct romanization and can't be fully supported by the simple model used by anyascii (context free 1-to-1 replacements). You should use a separate language-specific method to romanize the Ukrainian Cyrillic and then call anyascii afterwards if you still need to.

AnyAscii.transliterate(romanizeUkrainian(s))

I don't want to add the custom replacements logic into anyascii because it can easily be done beforehand and if someone wants language-specific replacements done they are probably better off using a language-specific library.

The test cases are not checking whether the result is perfect just that it stays consistent with the examples given in the readme. The readme examples are for highlighting the limitations of anyascii. There's a 4th column in the table that compares it to the correct romanization you may need to scroll to see.

@stephenwilcoxon
Copy link

To get back to the original issue of can't translate, would it be possible to have an option to keep characters that can't/aren't translated rather than drop them? In my use case, I'm still in UTF but want transliteration. However, if something can't/isn't transliterated, I want the original character kept (and not simply removed). The problem with simple removal is there is no way to know a character wasn't translated/transliterated (without checking the code tables).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants