Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note to Unicode unaware methods #9339

Merged
merged 1 commit into from Dec 8, 2020

Conversation

martijnhoekstra
Copy link
Contributor

@martijnhoekstra martijnhoekstra commented Nov 23, 2020

Some methods in StringOps are Unicode unaware and likely to return invalid strings if they are used without keeping surrogate pair handling in mind. These were previously deprecated, but were undeprecated again because they are useful in various contexts where unicode handling is not required.

I've added notes on those methods that return strings, and Unicode unaware handling by the user may return invalid strings. I've not added notes on methods that don't return String or methods that won't return invalid strings if the user never returns or specifies part of a surrogate pair.

The deprecation was #6702, undeprecation was #9246, bug report scala/bug#10881

The note is added to the originally deprecated methods diff, intersect, distinct, distinctBy, sorted, sortWith, sortBy, groupBy, sliding, combinations and permutations, as well as methods that change strings based on indices. Those where you may get it wrong for Unicode strings if you're doing naive index arithmetic around a surrogate pair: patch, slice, updated, tail and init, take, drop, takeRight , dropRight, and splitAt, and those that will always lead to invalid Unicode sequences if surrogate pairs are present: inits, tails, mkString and reverse

@martijnhoekstra martijnhoekstra added the documentation No code change. Only documentation label Nov 23, 2020
@scala-jenkins scala-jenkins added this to the 2.13.5 milestone Nov 23, 2020
@martijnhoekstra
Copy link
Contributor Author

arguably, the current implementation of mkString is a bug and should just be fixed to work on code points instead of Char's. I have difficulty imagining any scenario where someone depends on splitting up surrogate pairs in the current implementation.

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

src/library/scala/collection/StringOps.scala Show resolved Hide resolved
@SethTisue
Copy link
Member

squash please, and then let's merge

@som-snytt
Copy link
Contributor

Should mention that "unicode unaware" is not the opposite of "icode aware".

Some methods in StringOps are unicode unaware and likely to
return invalid strings if they are used without keeping
surrogate pair handling in mind. These were previously
deprecated, but were undeprecated again because they are
useful in various contexts where unicode handling is not
required.
@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented Dec 6, 2020

squashed, unfortunately without icode or further double negation puns

@SethTisue SethTisue merged commit 1c085bd into scala:2.13.x Dec 8, 2020
@SethTisue
Copy link
Member

🎵 "CHICKEN".mkString("-"), that's the way you spell “chicken” 🎵

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation No code change. Only documentation
Projects
None yet
5 participants