New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect String Comparison Results with ICU #101422
Comments
Tagging subscribers to this area: @dotnet/area-system-globalization |
Duplicate of #20599 (comment) |
Yes, this is a duplicate. @danstur you may try #20599 (comment) to get the desired behavior. Feel free to send any more question if you still have any. Thanks for your report. |
@tarekgh Can you explain the need for According to the ICU library Looking at the stackoverflow post, the commenters are all just as confused by the current behavior as I am. If this is the expected behavior, it'd be great if you could point to some more exhaustive documentation that explains exactly how the normal String APIs behave. I honestly couldn't say what the expected behavior of |
What about documenting that behavior? Like here https://learn.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings#linguistic-comparisons. The docu in its examples is talking explicitly about the "ß" vs. "SS" case implies looking at comparison overloads using StringComparison but that seems a dead end, at least to me, to get the expected standard behavior. None of the StringComparison option translates to or includes the needed CompareOptions.IgnoreNonSpace right? |
To give some more info why you are seeing this behavior, ICU collation work using what it is called collation strength. Strength can be Primary, Secondary, Tertiary, or Quaternary. We are trying to map as much as we can the .NET comparison options to one of these strength. which work fine except in such special cases. Unfortunately, ICU make I reactivated this issue to track collation customization for this special case of |
@tarekgh Thanks for the explanation, Unicode once again being even more complicated than anticipated. Is there any documentation what collation strength the values comparison options use and how I can figure out at what collation strength two characters compare equal? Ah Unicode causing headaches as usual. |
We don't document the mapping between the .NET options and ICU collation strength as this more implementation details. But you can see the mapping here
Note, the default strength is always |
As far as I can tell, its working as expected. You are using a collation comparison, not a comparison of case folded strings. The Unicode Collation Algorithm, follows the relevant DIN standard for sorting ß.
If you are comparing strings using full case folding, they will match. If you are comparing using simple casefolding, they will not match. If you compare using a collator set to Primary strength they will match. If you use a different strength for the collator, they will not match. Yes, under full casefolding, they should be treated the same, but they are intended to sort differently. |
@andjc Indeed, we recognize that this behavior aligns with what Unicode collation defines. However, the issue arises from the fact that .NET Framework previously utilized Windows collation (NLS), where |
I assume there are other differences as well. As a rule I use multiple operating systems and multiple programming languages. System based locale data and locale operations differ across implementations. I use ICU when I want consistent results across different platforms and different programming languages. ICU by default uses the CLDR collation Algorithm, what is referred to as the root collation, and tailors that as required per locale, with some locales having multiple collation tailorings. Changing the ICU collation to match NLS breaks that benefit of ICU. It also raises the question of whether German collation should be system depended, ie using NLS rules on windows and using platform specific rules on other platforms, ie divergence of results based on platform. What comes to mind is that ICU supports multiple tailorings, including for German, ie standard vs phonebook style collation. This can be enabled by using a variant locale, either using POSIX or BCP47 identifiers. Given that alternative collation rules are already available, a logical approach would be to add another locale variant to kick in NLS compatible collation. That way you can retain icu's collation and add a tailored collation that changes the collation weight of ß to match previous implementations. Although considering above it was noted that the collation strength is set to tertiary, how do you handle Japanese, I was under the impression that the excel and other apps used a sort that would be equivalent to a QUATERNARY strength. |
Regrettably, the behavior of equating 'ss' to 'ß' functioned uniformly across all locales in NLS, which is what users requesting this behavior. As previously noted in this issue, there exists a workaround for users who wish to utilize it, namely employing the 'IgnoreNonSpace' comparison option. In principle, we endeavor to adhere closely to CLDR/Unicode behavior, which is why no action has been taken on this issue thus far. However, users persist in expressing discontent regarding it. Mostly users used .NET Framework for awhile. |
Description
With the switch to the ICU library string comparisons do not work as expected. The behavior also differs from what ICU should generally return.
In a case insensitive comparison
SS
andß
do not compare equal. According to the current Unicode case folding rules they should be equal as I understand it:00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
(CaseFolding.txt)Also checking with the ICU Unicode String Comparison it says that the result should be equal.
It seems to me that somehow the specific ICU library used with .NET 6/8 has a bug or that it is used incorrectly.
See also https://stackoverflow.com/questions/78371156/%c3%9f-ss-for-case-insensitive-comparison-with-icu and https://stackoverflow.com/questions/78364649/why-does-%c3%9f-equalsss-stringcomparison-currentcultureignorecase-differ-betw
Reproduction Steps
"ß".Equals("SS", StringComparison.CurrentCultureIgnoreCase); // returns false when using ICU library
// The same is true for Contains and IndexOf
Expected behavior
The above code should return true.
Actual behavior
The above code returns false when the ICU library is used. Setting
the code returns true as expected
Regression?
A regression when compared to any < .NET 5 under Windows.
Known Workarounds
Specify
System.Globalization.UseNls" Value="true"
. Sadly this only works under Windows and does not help with other platforms.Configuration
Dotnet SDK: 8.0.104
OS: Windows 11 22H2 (22621.3447)
Architecture: x64
Other information
No response
The text was updated successfully, but these errors were encountered: