Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests to ensure Unicode string comparison is by codepoint #570

Open
Julian opened this issue Jul 5, 2022 · 8 comments
Open

Add tests to ensure Unicode string comparison is by codepoint #570

Julian opened this issue Jul 5, 2022 · 8 comments
Labels
good first issue An issue that is a good candidate for new contributors missing test A request to add a test to the suite that is currently not covered elsewhere.

Comments

@Julian
Copy link
Member

Julian commented Jul 5, 2022

§4.2.2 of the specification says, for string equality, that it is defined by:

[...] both are strings, and are the same codepoint-for-codepoint; [...]

We should add some tests to specifically ensure that a particular grapheme (e.g. ä) is not equal to the same grapheme represented by different code point(s).

@Julian Julian added missing test A request to add a test to the suite that is currently not covered elsewhere. good first issue An issue that is a good candidate for new contributors labels Jul 5, 2022
@karenetheridge
Copy link
Member

Also, that two graphemes that look the same aren't identical (for example µ (MICRO SIGN, U+00B5) is not the same as μ (GREEK SMALL LETTER MU, U+2192).

both are strings, and are the same codepoint-for-codepoint

This is actually kind of unfortunate wording, as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters (underlining, diacritics, accent marks). See http://www.unicode.org/reports/tr15/ on normalization forms - ideally we should specify that string comparisons are done on normalized strings, but I fear the overhead may be prohibitive to some implementations (especially when "most" strings are ascii anyway).

@awwright
Copy link
Member

as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters

I wrote the "codepoint-for-codepoint" text, in my basic understanding, there's multiple different ways to compare strings for equality, and preserving codepoints exactly is the only strategy that doesn't lose information. There's a potential that picking any other comparison algorithm could break compatibility with JSON.

Since JSON is Unicode, you can renormalize the whole document before validating it, and JSON Schema doesn't need to (normatively) mention this. You can also write your schema to use a particular normal form, and you tell applications "this schema expects [e.g.] NFC."

@zaplapl
Copy link

zaplapl commented Mar 31, 2024

I am looking for a first issue - and happy to have a crack at this one.

Would it be acceptable to put together a few cases based on the const keyword from draft 6?

Something like:

[
    {
        "description": "mu equality",
        "schema": {
            "$schema": "https://json-schema.org/draft/2019-09/schema",
            "const": "μ"
        },
        "tests": [
            {
                "description": "mu is valid",
                "data": "μ",
                "valid": true
            },
            {
                "description": "micro is invalid",
                "data": "µ",
                "valid": false
            },
            {
                "description": "mu code point is valid",
                "data": "\u03bc",
                "valid": true
            },
            {
                "description": "micro code point is invalid",
                "data": "\u00b5",
                "valid": false
            }
        ]
     }
  ]

@Julian
Copy link
Member Author

Julian commented Apr 1, 2024

I am looking for a first issue - and happy to have a crack at this one.

Thanks! That'd be really helpful!

Yep that looks good I think (other than better to mention we're testing codepoint equivalence than to mention mu in the description of the test case, that's not particularly relevant).

If we want to be extra nice probably having a second example for combining marks like the original description, i.e. "a\u0308" != "\u00e4".

@zaplapl
Copy link

zaplapl commented Apr 2, 2024

OK cool - thanks @Julian - point taken

I had a second thought after I sent the comment though

is "data" in the test schema meant to be a stand-in for JSON instance data?

basically what I'm getting at is that "\u03bc" is a perfectly valid string literal and that maybe we need to differentiate between cases where the characters that comprise a codepoint representation are being sent over the wire (or whatever) as opposed to the character identified by the codepoint

and I'm not sure whether the last two tests are correct or not

will have a think & a read and raise a PR in due course

comments appreciated in the meantime though

@Julian
Copy link
Member Author

Julian commented Apr 2, 2024

is "data" in the test schema meant to be a stand-in for JSON instance data?

Yes, data means "the instance you're testing". IIRC it predates us commonly using the word "instance" which is why it's named that way.

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ" as part of deserializing (as opposed to JSON Schema), so really there isn't likely to be any difference there, but it's fine/good I think to have both representations.

@zaplapl
Copy link

zaplapl commented Apr 14, 2024

Opened a PR

the tests are on draft6 since this is the earliest draft of the spec with the 'codepoint-for-codepoint' language.

I'm not proposing to add the tests for the suites for drafts 3 & 4 since those drafts just mention that strings should have the "same value" - and don't specify how those values are derived

but if these pass the smell test I'd be happy to copy the tests into the test suites for the other draft specs - I just didn't want to crowd the PR with too much duplicate text straight off the bat

@gregsdennis
Copy link
Member

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ"

I think this is the key. Since a parser is already doing the conversion, any tests we write are checking the parser, not the validator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue An issue that is a good candidate for new contributors missing test A request to add a test to the suite that is currently not covered elsewhere.
Projects
None yet
Development

No branches or pull requests

5 participants