Add tests to ensure Unicode string comparison is by codepoint #570

Julian · 2022-07-05T12:42:07Z

§4.2.2 of the specification says, for string equality, that it is defined by:

[...] both are strings, and are the same codepoint-for-codepoint; [...]

We should add some tests to specifically ensure that a particular grapheme (e.g. ä) is not equal to the same grapheme represented by different code point(s).

The text was updated successfully, but these errors were encountered:

karenetheridge · 2022-08-29T00:58:40Z

Also, that two graphemes that look the same aren't identical (for example µ (MICRO SIGN, U+00B5) is not the same as μ (GREEK SMALL LETTER MU, U+2192).

both are strings, and are the same codepoint-for-codepoint

This is actually kind of unfortunate wording, as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters (underlining, diacritics, accent marks). See http://www.unicode.org/reports/tr15/ on normalization forms - ideally we should specify that string comparisons are done on normalized strings, but I fear the overhead may be prohibitive to some implementations (especially when "most" strings are ascii anyway).

awwright · 2022-09-28T18:34:58Z

as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters

I wrote the "codepoint-for-codepoint" text, in my basic understanding, there's multiple different ways to compare strings for equality, and preserving codepoints exactly is the only strategy that doesn't lose information. There's a potential that picking any other comparison algorithm could break compatibility with JSON.

Since JSON is Unicode, you can renormalize the whole document before validating it, and JSON Schema doesn't need to (normatively) mention this. You can also write your schema to use a particular normal form, and you tell applications "this schema expects [e.g.] NFC."

zaplapl · 2024-03-31T14:36:17Z

I am looking for a first issue - and happy to have a crack at this one.

Would it be acceptable to put together a few cases based on the const keyword from draft 6?

Something like:

[
    {
        "description": "mu equality",
        "schema": {
            "$schema": "https://json-schema.org/draft/2019-09/schema",
            "const": "μ"
        },
        "tests": [
            {
                "description": "mu is valid",
                "data": "μ",
                "valid": true
            },
            {
                "description": "micro is invalid",
                "data": "µ",
                "valid": false
            },
            {
                "description": "mu code point is valid",
                "data": "\u03bc",
                "valid": true
            },
            {
                "description": "micro code point is invalid",
                "data": "\u00b5",
                "valid": false
            }
        ]
     }
  ]

Julian · 2024-04-01T12:44:21Z

I am looking for a first issue - and happy to have a crack at this one.

Thanks! That'd be really helpful!

Yep that looks good I think (other than better to mention we're testing codepoint equivalence than to mention mu in the description of the test case, that's not particularly relevant).

If we want to be extra nice probably having a second example for combining marks like the original description, i.e. "a\u0308" != "\u00e4".

zaplapl · 2024-04-02T15:08:27Z

OK cool - thanks @Julian - point taken

I had a second thought after I sent the comment though

is "data" in the test schema meant to be a stand-in for JSON instance data?

basically what I'm getting at is that "\u03bc" is a perfectly valid string literal and that maybe we need to differentiate between cases where the characters that comprise a codepoint representation are being sent over the wire (or whatever) as opposed to the character identified by the codepoint

and I'm not sure whether the last two tests are correct or not

will have a think & a read and raise a PR in due course

comments appreciated in the meantime though

Julian · 2024-04-02T15:14:47Z

is "data" in the test schema meant to be a stand-in for JSON instance data?

Yes, data means "the instance you're testing". IIRC it predates us commonly using the word "instance" which is why it's named that way.

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ" as part of deserializing (as opposed to JSON Schema), so really there isn't likely to be any difference there, but it's fine/good I think to have both representations.

zaplapl · 2024-04-14T11:01:47Z

Opened a PR

the tests are on draft6 since this is the earliest draft of the spec with the 'codepoint-for-codepoint' language.

I'm not proposing to add the tests for the suites for drafts 3 & 4 since those drafts just mention that strings should have the "same value" - and don't specify how those values are derived

but if these pass the smell test I'd be happy to copy the tests into the test suites for the other draft specs - I just didn't want to crowd the PR with too much duplicate text straight off the bat

gregsdennis · 2024-04-14T12:19:05Z

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ"

I think this is the key. Since a parser is already doing the conversion, any tests we write are checking the parser, not the validator.

Julian added missing test A request to add a test to the suite that is currently not covered elsewhere. good first issue An issue that is a good candidate for new contributors labels Jul 5, 2022

zaplapl mentioned this issue Apr 14, 2024

Add tests to ensure Unicode string comparison is by codepoint - Issue #570 #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests to ensure Unicode string comparison is by codepoint #570

Add tests to ensure Unicode string comparison is by codepoint #570

Julian commented Jul 5, 2022

karenetheridge commented Aug 29, 2022

awwright commented Sep 28, 2022

zaplapl commented Mar 31, 2024

Julian commented Apr 1, 2024

zaplapl commented Apr 2, 2024

Julian commented Apr 2, 2024

zaplapl commented Apr 14, 2024

gregsdennis commented Apr 14, 2024

Add tests to ensure Unicode string comparison is by codepoint #570

Add tests to ensure Unicode string comparison is by codepoint #570

Comments

Julian commented Jul 5, 2022

karenetheridge commented Aug 29, 2022

awwright commented Sep 28, 2022

zaplapl commented Mar 31, 2024

Julian commented Apr 1, 2024

zaplapl commented Apr 2, 2024

Julian commented Apr 2, 2024

zaplapl commented Apr 14, 2024

gregsdennis commented Apr 14, 2024