fix(marshal)!: compare strings by codepoint #2008

erights · 2024-01-25T21:05:35Z

closes: #2113
refs: #2002

Description

JavaScript's relational comparison operators like < compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, compareRank and associated functions compared strings using this JavaScript-native comparison. Now compareRank and associated functions compare strings by lexicographic Unicode Code Point order. This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits.
- This release does not change the encodePassable encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. encodePassable is rank-order preserving when the encoded strings are compared using compareRank.
- The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode Code Points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like M.gte(string).

Security Considerations

The fact that the string ordering is closer to the Unicode semantics of the strings probably minimizes some surprises in ways that help security. OTOH, this difference from JS native string ordering probably causes other surprises that hurt security. Altogether, we do not expect much effect.

Scaling Considerations

As a comparison written in JS, will be slower that the JS native string comparison. On XS at least, we expect to have a native code point comparison function available eventually. Altogether, we do not expect much effect.

Documentation Considerations

Most developers will not care. But it needs to be explained somewhere carefully so that developers that do care can easily find out.

Testing Considerations

@gibson042 , in a later PR, could you expand the property-based-testing to generate test cases sensitive to this change?

Compatibility Considerations

These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior compareRank function. You may need to revisit any use of patterns like M.gte(string) expressing inequalities over strings.

Upgrade Considerations

If we currently have any persistent data, especially on chain, sorted according to JS native order (UTF16 code unit), then we cannot accept this PR until we have a plan to resort that data, or somehow continue to live with mis-sorted. (Historical note: This is how Oracle came to permanently rely on UTF16 code unit order, because of the impracticality of resorting all that data.)

Includes *BREAKING*: in the commit message with migration instructions for any breaking change.
Updates NEWS.md for user-facing changes.

kriskowal

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

erights · 2024-01-26T01:47:28Z

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

Let me be sure I understand:

You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 as well. Right?

kriskowal · 2024-01-26T02:01:34Z

Yes

…

On Thu, Jan 25, 2024 at 5:47 PM Mark S. Miller ***@***.***> wrote: Excellent. For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot. Let me be sure I understand: You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 <#2002> as well. Right? — Reply to this email directly, view it on GitHub <#2008 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOXBRXXSUOYVBOWVGDT4TYQMDLVAVCNFSM6AAAAABCLFMFD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGI3TSNBQHE> . You are receiving this because you commented.Message ID: ***@***.***>

erights · 2024-01-29T05:43:54Z

Just noting here for curiosity. In the UTF16 portion of https://icu-project.org/docs/papers/utf16_code_point_order.html

This opens the door for a "fix-up" of code unit values that is faster than assembling 21-bit code point values.

OMG

erights · 2024-01-29T07:18:58Z

packages/marshal/NEWS.md

+# next release
+
+- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
+  - This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.


@gibson042 is this true? It was true for my small test case, which proves very little. Will the same property also be true for compactOrdered? For either, does restricting these strings to well-ordered have any effect on whether their encoding is order preserving?

It is true now, but I think that's a mistake... recordNames and any similar function that .sort()s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints) so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.

Which unfortunately complicates adoption if we have existing use of any such strings.

Good point, and bad news!

Grepping for .sort() specifically with nothing between the parens, I see 96 occurrences in agoric-sdk and 26 in endo. Some may not be or contain strings. But still, fixing all that do will be disruptive. And the longer we wait, the more disruptive it'll be.

I'm putting this back into Draft until we decide what our plan is. Attn @ivanlei

Is there any practical way to scan a recent snapshot of our chain and somehow see how many persistent strings are

non-ascii,

non-well-formed, or

have supplementary characters (those whose code is > 16 bits)

?
How hard would it be?
Attn @mhofman

NOT URGENT.

kriskowal

Thanks for the detailed NEWS.md.

packages/marshal/src/rankOrder.js

packages/marshal/test/test-encodePassable.js

packages/marshal/NEWS.md

gibson042 · 2024-01-30T00:35:19Z

packages/marshal/NEWS.md

+# next release
+
+- JavaScript's relational comparison operators like `<` compare strings by lexicographic UTF16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, `compareRank` and associated functions compared strings using this JavaScript-native comparison. Now `compareRank` and associated functions compare strings by lexicographic Unicode Code Point order. ***This change only affects strings containing so-called supplementary characters, i.e., those whose Unicode character code does not fit in 16 bits***.
+  - This release does not change the `encodePassable` encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. `encodePassable` is rank-order preserving when the encoded strings are compared using `compareRank`.


It is true now, but I think that's a mistake... recordNames and any similar function that .sort()s an array of strings in marshal or a related package should probably be updated to .sort(compareByCodePoints) so the encoding of Copy{Record,Set,Bag,Map}s and their own comparison is consistent with that of their constituent strings.

Which unfortunately complicates adoption if we have existing use of any such strings.

packages/marshal/src/rankOrder.js

erights requested a review from gibson042 January 25, 2024 21:05

erights self-assigned this Jan 25, 2024

erights force-pushed the markm-rank-strings-by-codepoint branch from 639c3e3 to 0c5d518 Compare January 25, 2024 21:10

kriskowal reviewed Jan 26, 2024

View reviewed changes

erights force-pushed the markm-rank-strings-by-codepoint branch from 0c5d518 to 4824f1c Compare January 26, 2024 01:50

erights mentioned this pull request Jan 26, 2024

feat(pass-style): feature flag: only well-formed strings are passable #2002

Merged

1 task

erights force-pushed the markm-rank-strings-by-codepoint branch 3 times, most recently from 1312009 to 7a3a43a Compare January 29, 2024 07:15

erights commented Jan 29, 2024

View reviewed changes

erights force-pushed the markm-rank-strings-by-codepoint branch from 7a3a43a to 12b6ebe Compare January 29, 2024 07:32

erights requested review from dckc and ivanlei January 29, 2024 07:47

erights marked this pull request as ready for review January 29, 2024 07:47

erights requested a review from kriskowal January 29, 2024 07:47

kriskowal approved these changes Jan 30, 2024

View reviewed changes

packages/marshal/src/rankOrder.js Show resolved Hide resolved

packages/marshal/test/test-encodePassable.js Outdated Show resolved Hide resolved

gibson042 reviewed Jan 30, 2024

View reviewed changes

erights marked this pull request as draft January 30, 2024 04:15

erights force-pushed the markm-rank-strings-by-codepoint branch from fd874a2 to b6580c5 Compare February 3, 2024 02:07

dckc mentioned this pull request Feb 12, 2024

test(marshal,exo): stop exporting from test-*.js files #2053

Merged

2 tasks

erights mentioned this pull request Mar 3, 2024

Must compare strings by codepoint instead of codeunit #2113

Open

erights force-pushed the markm-rank-strings-by-codepoint branch from b6580c5 to 84b8abd Compare March 13, 2024 01:10

erights force-pushed the markm-rank-strings-by-codepoint branch 2 times, most recently from 567b23c to 3a74c9c Compare March 24, 2024 18:11

dckc removed their request for review March 25, 2024 19:51

erights force-pushed the markm-rank-strings-by-codepoint branch 2 times, most recently from 939657b to 8d8375e Compare April 8, 2024 22:03

erights force-pushed the markm-rank-strings-by-codepoint branch from 8d8375e to c2a3302 Compare April 14, 2024 20:03

erights force-pushed the markm-rank-strings-by-codepoint branch 3 times, most recently from 219fa1b to 7291ff7 Compare April 30, 2024 20:18

erights force-pushed the markm-rank-strings-by-codepoint branch 4 times, most recently from fa55986 to 70402a0 Compare May 7, 2024 21:10

fix(marshal)!: compare strings by codepoint

01394d3

erights force-pushed the markm-rank-strings-by-codepoint branch from 70402a0 to 01394d3 Compare May 9, 2024 00:08

erights added 2 commits May 10, 2024 20:23

Merge branch 'master' into markm-rank-strings-by-codepoint

44b1665

Merge branch 'master' into markm-rank-strings-by-codepoint

9e049f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(marshal)!: compare strings by codepoint #2008

fix(marshal)!: compare strings by codepoint #2008

erights commented Jan 25, 2024 •

edited

kriskowal left a comment

erights commented Jan 26, 2024

kriskowal commented Jan 26, 2024 via email

erights commented Jan 29, 2024

erights Jan 29, 2024

gibson042 Jan 30, 2024

erights Jan 30, 2024

erights Jan 30, 2024

kriskowal left a comment

gibson042 Jan 30, 2024

fix(marshal)!: compare strings by codepoint #2008

Are you sure you want to change the base?

fix(marshal)!: compare strings by codepoint #2008

Conversation

erights commented Jan 25, 2024 • edited

Description

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Compatibility Considerations

Upgrade Considerations

kriskowal left a comment

Choose a reason for hiding this comment

erights commented Jan 26, 2024

kriskowal commented Jan 26, 2024 via email

erights commented Jan 29, 2024

erights Jan 29, 2024

Choose a reason for hiding this comment

gibson042 Jan 30, 2024

Choose a reason for hiding this comment

erights Jan 30, 2024

Choose a reason for hiding this comment

erights Jan 30, 2024

Choose a reason for hiding this comment

kriskowal left a comment

Choose a reason for hiding this comment

gibson042 Jan 30, 2024

Choose a reason for hiding this comment

erights commented Jan 25, 2024 •

edited