Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(marshal)!: compare strings by codepoint #2008

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

erights
Copy link
Contributor

@erights erights commented Jan 25, 2024

closes: #2113
refs: #2002

Description

  • JavaScript's relational comparison operators like < compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, compareRank and associated functions compared strings using this JavaScript-native comparison. Now compareRank and associated functions compare strings by lexicographic Unicode Code Point order. This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits.
    • This release does not change the encodePassable encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. encodePassable is rank-order preserving when the encoded strings are compared using compareRank.
    • The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode Code Points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like M.gte(string).

Security Considerations

The fact that the string ordering is closer to the Unicode semantics of the strings probably minimizes some surprises in ways that help security. OTOH, this difference from JS native string ordering probably causes other surprises that hurt security. Altogether, we do not expect much effect.

Scaling Considerations

As a comparison written in JS, will be slower that the JS native string comparison. On XS at least, we expect to have a native code point comparison function available eventually. Altogether, we do not expect much effect.

Documentation Considerations

Most developers will not care. But it needs to be explained somewhere carefully so that developers that do care can easily find out.

Testing Considerations

@gibson042 , in a later PR, could you expand the property-based-testing to generate test cases sensitive to this change?

Compatibility Considerations

  • These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
  • To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior compareRank function. You may need to revisit any use of patterns like M.gte(string) expressing inequalities over strings.

Upgrade Considerations

If we currently have any persistent data, especially on chain, sorted according to JS native order (UTF16 code unit), then we cannot accept this PR until we have a plan to resort that data, or somehow continue to live with mis-sorted. (Historical note: This is how Oracle came to permanently rely on UTF16 code unit order, because of the impracticality of resorting all that data.)

  • Includes *BREAKING*: in the commit message with migration instructions for any breaking change.
  • Updates NEWS.md for user-facing changes.

@erights erights self-assigned this Jan 25, 2024
@erights erights force-pushed the markm-rank-strings-by-codepoint branch from 639c3e3 to 0c5d518 Compare January 25, 2024 21:10
Copy link
Member

@kriskowal kriskowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

@erights
Copy link
Contributor Author

erights commented Jan 26, 2024

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

Let me be sure I understand:

You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 as well. Right?

@erights erights force-pushed the markm-rank-strings-by-codepoint branch from 0c5d518 to 4824f1c Compare January 26, 2024 01:50
@kriskowal
Copy link
Member

kriskowal commented Jan 26, 2024 via email

@erights
Copy link
Contributor Author

erights commented Jan 29, 2024

Just noting here for curiosity. In the UTF16 portion of https://icu-project.org/docs/papers/utf16_code_point_order.html

This opens the door for a "fix-up" of code unit values that is faster than assembling 21-bit code point values.

OMG

@erights erights force-pushed the markm-rank-strings-by-codepoint branch 3 times, most recently from 1312009 to 7a3a43a Compare January 29, 2024 07:15
# next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibson042 is this true? It was true for my small test case, which proves very little. Will the same property also be true for compactOrdered? For either, does restricting these strings to well-ordered have any effect on whether their encoding is order preserving?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true now, but I think that's a mistake... recordNames and any similar function that .sort()s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints) so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.

Which unfortunately complicates adoption if we have existing use of any such strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and bad news!

Grepping for .sort() specifically with nothing between the parens, I see 96 occurrences in agoric-sdk and 26 in endo. Some may not be or contain strings. But still, fixing all that do will be disruptive. And the longer we wait, the more disruptive it'll be.

I'm putting this back into Draft until we decide what our plan is. Attn @ivanlei

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any practical way to scan a recent snapshot of our chain and somehow see how many persistent strings are

  • non-ascii,
  • non-well-formed, or
  • have supplementary characters (those whose code is > 16 bits)

?
How hard would it be?
Attn @mhofman

NOT URGENT.

@erights erights force-pushed the markm-rank-strings-by-codepoint branch from 7a3a43a to 12b6ebe Compare January 29, 2024 07:32
@erights erights requested review from dckc and ivanlei January 29, 2024 07:47
@erights erights marked this pull request as ready for review January 29, 2024 07:47
Copy link
Member

@kriskowal kriskowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed NEWS.md.

packages/marshal/src/rankOrder.js Show resolved Hide resolved
packages/marshal/test/test-encodePassable.js Outdated Show resolved Hide resolved
packages/marshal/NEWS.md Outdated Show resolved Hide resolved
# next release

- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
- This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true now, but I think that's a mistake... recordNames and any similar function that .sort()s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints) so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.

Which unfortunately complicates adoption if we have existing use of any such strings.

packages/marshal/src/rankOrder.js Show resolved Hide resolved
@erights erights marked this pull request as draft January 30, 2024 04:15
@erights erights force-pushed the markm-rank-strings-by-codepoint branch from fd874a2 to b6580c5 Compare February 3, 2024 02:07
@erights erights force-pushed the markm-rank-strings-by-codepoint branch from b6580c5 to 84b8abd Compare March 13, 2024 01:10
@erights erights force-pushed the markm-rank-strings-by-codepoint branch 2 times, most recently from 567b23c to 3a74c9c Compare March 24, 2024 18:11
@dckc dckc removed their request for review March 25, 2024 19:51
@erights erights force-pushed the markm-rank-strings-by-codepoint branch 2 times, most recently from 939657b to 8d8375e Compare April 8, 2024 22:03
@erights erights force-pushed the markm-rank-strings-by-codepoint branch from 8d8375e to c2a3302 Compare April 14, 2024 20:03
@erights erights force-pushed the markm-rank-strings-by-codepoint branch 3 times, most recently from 219fa1b to 7291ff7 Compare April 30, 2024 20:18
@erights erights force-pushed the markm-rank-strings-by-codepoint branch 4 times, most recently from fa55986 to 70402a0 Compare May 7, 2024 21:10
@erights erights force-pushed the markm-rank-strings-by-codepoint branch from 70402a0 to 01394d3 Compare May 9, 2024 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Must compare strings by codepoint instead of codeunit
4 participants