Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect String Comparison Results with ICU #101422

Open
danstur opened this issue Apr 23, 2024 · 12 comments
Open

Incorrect String Comparison Results with ICU #101422

danstur opened this issue Apr 23, 2024 · 12 comments

Comments

@danstur
Copy link

danstur commented Apr 23, 2024

Description

With the switch to the ICU library string comparisons do not work as expected. The behavior also differs from what ICU should generally return.

In a case insensitive comparison SS and ß do not compare equal. According to the current Unicode case folding rules they should be equal as I understand it:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S (CaseFolding.txt)

Also checking with the ICU Unicode String Comparison it says that the result should be equal.

It seems to me that somehow the specific ICU library used with .NET 6/8 has a bug or that it is used incorrectly.

See also https://stackoverflow.com/questions/78371156/%c3%9f-ss-for-case-insensitive-comparison-with-icu and https://stackoverflow.com/questions/78364649/why-does-%c3%9f-equalsss-stringcomparison-currentcultureignorecase-differ-betw

Reproduction Steps

"ß".Equals("SS", StringComparison.CurrentCultureIgnoreCase); // returns false when using ICU library
// The same is true for Contains and IndexOf

Expected behavior

The above code should return true.

Actual behavior

The above code returns false when the ICU library is used. Setting

<ItemGroup>
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
</ItemGroup>

the code returns true as expected

Regression?

A regression when compared to any < .NET 5 under Windows.

Known Workarounds

Specify System.Globalization.UseNls" Value="true" . Sadly this only works under Windows and does not help with other platforms.

Configuration

Dotnet SDK: 8.0.104
OS: Windows 11 22H2 (22621.3447)
Architecture: x64

Other information

No response

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Apr 23, 2024

Duplicate of #20599 (comment)

@jkotas jkotas marked this as a duplicate of #20599 Apr 23, 2024
@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2024

Yes, this is a duplicate. @danstur you may try #20599 (comment) to get the desired behavior. Feel free to send any more question if you still have any. Thanks for your report.

@tarekgh tarekgh closed this as completed Apr 23, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
@danstur
Copy link
Author

danstur commented Apr 23, 2024

@tarekgh Can you explain the need for IgnoreNonSpace which says "Indicates that the string comparison must ignore nonspacing combining characters, such as diacritics" and that rather more awkward API?

According to the ICU library ß should be case insensitive equal to ss independent of language, so it's not that I need some collation for this to work.
And neither ß nor ss contain any "nonspacing combining characters" but are all single codepoints and ß doesn't have any decompositions listed. Looking at the "based on s" list (https://www.compart.com/en/unicode/U+0073) there's also no ß in that list.

Looking at the stackoverflow post, the commenters are all just as confused by the current behavior as I am.

If this is the expected behavior, it'd be great if you could point to some more exhaustive documentation that explains exactly how the normal String APIs behave. I honestly couldn't say what the expected behavior of Equals(..., StringComparison.CurrentCultureIgnoreCase) is in .NET 8.

@Another-Ralf
Copy link

Another-Ralf commented Apr 23, 2024

What about documenting that behavior? Like here https://learn.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings#linguistic-comparisons.

The docu in its examples is talking explicitly about the "ß" vs. "SS" case implies looking at comparison overloads using StringComparison but that seems a dead end, at least to me, to get the expected standard behavior. None of the StringComparison option translates to or includes the needed CompareOptions.IgnoreNonSpace right?

@tarekgh tarekgh reopened this Apr 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2024

To give some more info why you are seeing this behavior, ICU collation work using what it is called collation strength. Strength can be Primary, Secondary, Tertiary, or Quaternary. We are trying to map as much as we can the .NET comparison options to one of these strength. which work fine except in such special cases. Unfortunately, ICU make ß equals only to ss if having the ICU strength is primary. We cannot switch to that strength by default in .NET because is going to break many other things. Even I saw before users complaining on ICU because of this case. The work around in .NET is to use other comparison option IgnoreNonSpace which will cause ICU to use the primary strength. You can play with that in the Collation Demo

I reactivated this issue to track collation customization for this special case of ss and ß .

@tarekgh tarekgh added this to the Future milestone Apr 23, 2024
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
@danstur
Copy link
Author

danstur commented Apr 23, 2024

@tarekgh Thanks for the explanation, Unicode once again being even more complicated than anticipated.

Is there any documentation what collation strength the values comparison options use and how I can figure out at what collation strength two characters compare equal?
Or do I just throw them into the collation demo and check that way what happens? I guess for my use cases that's fine too.

Ah Unicode causing headaches as usual.

@tarekgh
Copy link
Member

tarekgh commented Apr 23, 2024

We don't document the mapping between the .NET options and ICU collation strength as this more implementation details. But you can see the mapping here

UColAttributeValue strength = ucol_getStrength(pCollator);
if you are interested.

Note, the default strength is always Tertiary.

@andjc
Copy link

andjc commented Apr 28, 2024

As far as I can tell, its working as expected. You are using a collation comparison, not a comparison of case folded strings. The Unicode Collation Algorithm, follows the relevant DIN standard for sorting ß.

The UCA maintains compatibility with the DIN standard for sorting German by having the German sharp-s (U+00DF (ß) LATIN SMALL LETTER SHARP S) sort as a secondary difference with "SS", instead of having ß and SS match at the secondary level.

If you are comparing strings using full case folding, they will match. If you are comparing using simple casefolding, they will not match. If you compare using a collator set to Primary strength they will match. If you use a different strength for the collator, they will not match.

Yes, under full casefolding, they should be treated the same, but they are intended to sort differently.

@tarekgh
Copy link
Member

tarekgh commented Apr 28, 2024

@andjc Indeed, we recognize that this behavior aligns with what Unicode collation defines. However, the issue arises from the fact that .NET Framework previously utilized Windows collation (NLS), where ss equated to ß. With the transition to ICU in .NET Core, some users have voiced concerns about this new behavior. The question at hand is whether we should adhere to the Unicode collation behavior or make a special case for this particular scenario.

@andjc
Copy link

andjc commented Apr 28, 2024

I assume there are other differences as well.

As a rule I use multiple operating systems and multiple programming languages.

System based locale data and locale operations differ across implementations. I use ICU when I want consistent results across different platforms and different programming languages.

ICU by default uses the CLDR collation Algorithm, what is referred to as the root collation, and tailors that as required per locale, with some locales having multiple collation tailorings.

Changing the ICU collation to match NLS breaks that benefit of ICU. It also raises the question of whether German collation should be system depended, ie using NLS rules on windows and using platform specific rules on other platforms, ie divergence of results based on platform.

What comes to mind is that ICU supports multiple tailorings, including for German, ie standard vs phonebook style collation. This can be enabled by using a variant locale, either using POSIX or BCP47 identifiers. Given that alternative collation rules are already available, a logical approach would be to add another locale variant to kick in NLS compatible collation.

That way you can retain icu's collation and add a tailored collation that changes the collation weight of ß to match previous implementations.

Although considering above it was noted that the collation strength is set to tertiary, how do you handle Japanese, I was under the impression that the excel and other apps used a sort that would be equivalent to a QUATERNARY strength.

@tarekgh
Copy link
Member

tarekgh commented Apr 28, 2024

What comes to mind is that ICU supports multiple tailorings, including for German, ie standard vs phonebook style collation. This can be enabled by using a variant locale, either using POSIX or BCP47 identifiers. Given that alternative collation rules are already available, a logical approach would be to add another locale variant to kick in NLS compatible collation.

Regrettably, the behavior of equating 'ss' to 'ß' functioned uniformly across all locales in NLS, which is what users requesting this behavior. As previously noted in this issue, there exists a workaround for users who wish to utilize it, namely employing the 'IgnoreNonSpace' comparison option.

In principle, we endeavor to adhere closely to CLDR/Unicode behavior, which is why no action has been taken on this issue thus far. However, users persist in expressing discontent regarding it. Mostly users used .NET Framework for awhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants