Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure non-ASCII decimal digits are also isdigit #54447

Closed
wants to merge 1 commit into from

Conversation

Seelengrab
Copy link
Contributor

@Seelengrab Seelengrab commented May 12, 2024

I noticed that

julia> '' |> isdigit
false

even though

julia> ''
'': Unicode U+FF14 (category Nd: Number, decimal digit)

The docstring doesn't say that this only checks ASCII, and the unicode category (see here) is an exact match for the definition (ten glyphs making up the numbers zero through nine, in various scripts). So while some things will still slip through (e.g. , which is considered a letter by Unicode), this implementation is a tiny bit more correct.

However, it's not just roses and sunshine; there is a downside, in that this won't vectorize anymore due to the ccall behind category_code. We could bring the double table lookup into pure Julia, but I'm not sure whether that would restore performance here. Maybe it could? As is, this PR comes with a 100x regression in a microbenchmark.

v1.11-alpha2:

julia> @benchmark count(isdigit, v) setup=(v=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (min  max):  392.871 ns  509.455 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     400.743 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   405.332 ns ±  12.212 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅█▆▇▆▄▂                  ▁▁   ▁▃▂▂▂                     ▂
  ▆▂▅▅██████████▇▆▅▄▄▄▄▂▄▄▄▅▅▅▆▆▇██▇▇▆███████▇▇▇▆▇▆▆▇▆▆▆▇█████▇ █
  393 ns        Histogram: log(frequency) by time        446 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This PR:

julia> @benchmark count(isdigit, v) setup=(v=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  40.870 μs  483.089 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     42.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.848 μs ±  19.561 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▃▃▂                                                        ▁
  ████████▆▆▄▄▆▄▅▅▅▄▅▅▇▄▄▄▄▄▁▁▁▃▁▃▁▁▁▁▁▁▁▁▃▇██▇▅▆▃▆█▄▄▄▃▁▁▁▁▁█ █
  40.9 μs       Histogram: log(frequency) by time       129 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

In a single-invocation benchmark, we "merely" have a 2-3x/10ns regression:

v1.11-alpha2:

julia> @benchmark isdigit(v) setup=(v=rand(Char)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.000 ns  40.000 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     20.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.548 ns ±  4.491 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                            █  
  ▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▂
  10 ns           Histogram: frequency by time          20 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This PR:

julia> @benchmark isdigit(v) setup=(v=rand(Char)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  20.000 ns  520.000 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     30.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.170 ns ±  31.893 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇  █  ▂                      ▂   ▂                           ▁
  █▁▁█▁▁█▁▁▇▁▁▁▅▁▁▄▁▁▁▁▁▁▁▁▁▆▁▁█▁▁▁█▁▁█▁▁▅▁▁▁▄▁▁▃▁▁▄▁▁▁▃▁▁▇▁▁▅ █
  20 ns         Histogram: log(frequency) by time       200 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

That being said, we already check category_code pretty much everywhere else, and this microbenchmark is extremely unlikely to be representative of an actual workload.

@Seelengrab Seelengrab added the domain:unicode Related to unicode characters and encodings label May 12, 2024
@jakobnissen
Copy link
Contributor

The current (ASCII only) definition of isdigit is used a few places in Base and stdlibs. If this PR should land, then the rest of Julia Base should be updated. For example, Base.TOML.isvalid_barekey_char assumes that isidigt means ASCII.
Also, there are probably other uses where the new implementation would still work, but is unnecessary.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented May 12, 2024

That may be true, but on the other hand, these uses relying on ASCII-only are all undocumented assumptions :/ Worse, isdigit refers to isletter, which does specify that it's about the Unicode category. It's very reasonable to infer that this also is true for isdigit.

The failures in the Unicode stdlib are real - I only tested with test/unicode/utf8.jl from Base (I don't know why these differ). In particular, the failures are with these characters, which are a mix of decimal digits and non-digit numerics:

julia> for c in ['٣', '٥', '٨', '¹', '']
           show(stdout, MIME"text/plain"(), c)
           println()
       end
'٣': Unicode U+0663 (category Nd: Number, decimal digit)
'٥': Unicode U+0665 (category Nd: Number, decimal digit)
'٨': Unicode U+0668 (category Nd: Number, decimal digit)
'¹': Unicode U+00B9 (category No: Number, other)
'': Unicode U+2173 (category Nl: Number, letter)

Doubly worse, the test failing here is in a @testset specifically handling utf8proc predicates.. The issue referenced there also mentions isdigit, but that doesn't seem to have actually happened, judging by the PR that added the utf8proc queries.

Would adding a predicate isasciidigit be an option? It's really weird to me to have lots of utf8proc-based predicates and then a single one that's ASCII-only, while not mentioning that fact in the docstring.

@Seelengrab
Copy link
Contributor Author

Looking through all those old PRs/issue, maybe @stevengj has some additional context/guidance here :)

@nhz2
Copy link
Contributor

nhz2 commented May 12, 2024

The docs for isdigit states it only includes characters 0-9.

Tests whether a character is a decimal digit (0-9).

julia> c = ''
'': Unicode U+FF14 (category Nd: Number, decimal digit)

julia> c in '0':'9'
false

So I don't see an issue with the current behavior.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented May 12, 2024

There are more than just the ASCII decimal digits in Unicode, hence the example above with a 4 from a different script ;) Both that 4 and the ASCII 4 are decimal digits, and would thus fit the docstring.

See e.g. here for a list.

@Seelengrab Seelengrab changed the title Ensure non-ASCII digits are also isdigit Ensure non-ASCII decimal digits are also isdigit May 12, 2024
@nhz2
Copy link
Contributor

nhz2 commented May 12, 2024

Yes, the docstring should be updated to be explicit about only including ASCII decimal digit then, but I have code that would break if this function returns true for other chars out of the range '0':'9'

@stevengj
Copy link
Member

If I remember, this was following the C isdigit function. I think the rationale was that isdigit checks are mostly use for parsing tasks that expect ASCII numbers.

@jakobnissen
Copy link
Contributor

jakobnissen commented May 13, 2024

I would favor @Seelengrab's proposed behaviour, for the following reasons:

  • It mirrors other functions like isletter, islowercase, isuppercase, isspace, ispunct, and isprint, which work using Unicode categories. Having isdigit only apply for ASCII is a strange and unexpected exception to the rule. The existing behaviour would be much clearer under a name like is_ascii_digit.
  • The documentation of isdigit says that it checks if a character "a decimal digit (0-9)". To me that clearly means whether it represents one of the numbers 0-9. If it was supposed to mean whether the digit was in '0':'9', then surely the documentation would have said "whether a character is a decimal digit (in the range '0':'9')."
  • Whether using the ASCII or Unicode definition is more or less useful in parsing tasks is not an important criterion IMO. Just because 80% of users only care about ASCII for specific use cases doesn't mean we don't need to also consider the 20% of the cases.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 13, 2024

The TOML parser is bugged with this:

julia> TOML.parse("4= \"value\"")
ERROR: TOML Parser error:
none:1:1 error: invalid bare key character: ''= "value"
  ^          
Stacktrace:
...

julia> @eval Base.Unicode begin
           isdigit(c::AbstractChar) = category_code(c) == UTF8PROC_CATEGORY_ND
       end;

julia> TOML.parse("4= \"value\"")
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'', [4]=>'='
Stacktrace:
  [1] string_index_err(s::String, i::Int64)
    @ Base ./strings/string.jl:12
  [2] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:35
  [3] SubString
    @ ./strings/substring.jl:49 [inlined]
  [4] SubString
    @ ./strings/substring.jl:52 [inlined]
  [5] take_substring
    @ ./toml_parser.jl:440 [inlined]
  [6] _parse_key(l::Base.TOML.Parser)
    @ Base.TOML ./toml_parser.jl:623

this microbenchmark is extremely unlikely to be representative of an actual workload.

Checking for digits feels quite common inside hot parsing loops. Also, if existing uses assumed that isdigit was only true for ASCII this could potentially cause segfaults/UB since a string consisting of only digits can no longer be assumed to have a certain size (which is probably the cause of the TOML failure).

Instead of slowing down all uses of this in the ecosystem for the reason "this implementation is a tiny bit more correct" could a new function be added (and referred to from isdigit in a note)? It just doesn't seem worth modifying this pragmatically.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented May 13, 2024

The TOML parser is bugged with this:

ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'4', [4]=>'='

a string consisting of only digits can no longer be assumed to have a certain size (which is probably the cause of the TOML failure).

IMO that is more an issue with assuming l.prevpos-1 is always a valid index into the String:

take_substring(l::Parser) = SubString(l.str, l.marker:(l.prevpos-1))

which isn't the case in Unicode. IMO this should use prevind instead, but that's a bit off topic here. To be fair, I don't know my way around the TOML parser so I can't tell whether that use is ordinarily ok or not 🤷

There's this bit about isvalid_barekey_char:

julia/base/toml_parser.jl

Lines 589 to 592 in 5006312

# SPEC: "Bare keys may only contain ASCII letters, ASCII digits, underscores,
# and dashes (A-Za-z0-9_-).
# Note that bare keys are allowed to be composed of only ASCII digits, e.g. 1234,
# but are always interpreted as strings."

but that seems inconsistent with the spec:

Bare keys may contain any letter-like or number-like Unicode character from any Unicode script, as well as ASCII digits, dashes and underscores.

I'm happy to add tests & fixes to Base here, but the fact that only the Unicode-stdlib tests failed suggest to me that this is either not well tested, or not that big of a change after all.

Checking for digits feels quite common inside hot parsing loops
Instead of slowing down all uses of this in the ecosystem for the reason "this implementation is a tiny bit more correct"

Having written some of those hot parsing loops myself, it's quite common to replace the category code-based functions from Base with custom ones specialized for ASCII already - what's one more? IMO, utility functionality provided by Base, working on types defined in Base, should handle the canonical interpretation of those types, and not a special case. That is, the utility functions in this case should follow what Unicode considers a decimal digit.

could a new function be added (and referred to from isdigit in a note)?

That's also an option, but then we have the odd situation that only isdigit is the odd one out with having to use a different function to check what Unicode considers a digit, when everything else already does that by default. String is a UTF-8 string after all, not an ASCIIString, and Char is a Unicode codepoint, not an ASCII character.

To be clear, the usual arithmetic people sometimes do (e.g. '0' + 1 == '1') should still work with everything Unicode considers a decimal digit, according to the category definition linked above. So if code breaks as a result of this, it either assumed it was working on ASCII already (which should already be guarded behind an isascii check or other guarantee that the input is in fact ASCII), or it wasn't correctly handling Unicode string indexing 🤔

@KristofferC
Copy link
Sponsor Member

The TOML spec is here https://toml.io/en/v1.0.0#keys

Bare keys may only contain ASCII letters, ASCII digits, underscores, and dashes (A-Za-z0-9_-). Note that bare keys are allowed to be composed of only ASCII digits, e.g. 1234, but are always interpreted as strings.


So if code breaks as a result of this, it either assumed it was working on ASCII already (which should already be guarded behind an isascii check or other guarantee that the input is in fact ASCII), or it wasn't correctly handling Unicode string indexing

As I wrote and demonstrated, there is code that assumes that isdigit does what it used to do. Making isdigit slower and changing it in such a way that it can introduce bugs by making old code have to handle new data is unlikely to be beneficial from a pragmatic point of view.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented May 13, 2024

The TOML spec is here https://toml.io/en/v1.0.0#keys

Thanks for pointing that out! Seems like the version I linked is the in-development version, so a future version of TOML is likely going to support Unicode there.

As I wrote and demonstrated, there is code that assumes that isdigit does what it used to do. Making isdigit slower and changing it in such a way that it can introduce bugs by making old code have to handle new data is unlikely to be beneficial from a pragmatic point of view.

I understand that, but this is a largely undocumented assumption. @jakobnissen also found an issue with the current implementation, which doesn't handle malformed data properly:

julia> isdigit(reinterpret(Char, 0x30000001))
true

which this PR does handle, returning false. Naively, that would be much more catastrophic for the TOML parser (or any other parser, really) than just checking isascii (which, again, I'm happy to insert).

@stevengj
Copy link
Member

stevengj commented May 13, 2024

Clearly, changing isdigit to return true for non-ASCII digits would be breaking and can't happen in 1.x. In the short term we should simply document the current behavior more clearly (#54492).

The question is whether we should export a new function that returns true for any character in Unicode category Nd ("Number, decimal digit"). What applications would need this functionality?

It seems more useful, and more general, to simply export some variant of Base.Unicode.category_code, maybe returning an enum, so that users can test for whatever category they want.

@stevengj stevengj added kind:breaking This change will break code kind:speculative Whether the change will be implemented is speculative labels May 13, 2024
@stevengj stevengj added this to the Potential 2.0 milestone May 13, 2024
@inkydragon
Copy link
Sponsor Member

whether we should export a new function that returns true for any character in Unicode category Nd ("Number, decimal digit").

It seems we already have it: isnumeric

  isnumeric(c::AbstractChar) -> Bool

  Tests whether a character is numeric. A character is classified as numeric if it belongs to the Unicode general category Number, i.e. a character whose category code begins with 'N'.

  Note that this broad category includes characters such as ¾ and ௰. Use isdigit to check whether a character a decimal digit between 0 and 9.

anumber=['0', '1', '5', '9']
unumber=['٣', '٥', '٨', '¹', '' ]
for c in anumber
@test isdigit(c) == true
@test isnumeric(c) == true
end
for c in unumber
@test isdigit(c) == false
@test isnumeric(c) == true
end

@Seelengrab
Copy link
Contributor Author

Seelengrab commented May 16, 2024

It seems we already have it: isnumeric

No, that returns true for any category that's numeric, this Pr is specifically about numeric decimal digits. E.g. is isnumeric, but is not a decimal digit. Its unicode category is Nl (Numeric, letter) not Nd (Numeric, decimal digit):

julia> 'ⅳ'
'ⅳ': Unicode U+2173 (category Nl: Number, letter)

julia> 'ⅳ' |> isnumeric
true

julia> 'ⅳ' |> Base.Unicode.category_code
10

julia> Base.Unicode.UTF8PROC_CATEGORY_ND
9

This PR is about UTF8PROC_CATEGORY_ND specifically, and would return false for 'ⅳ'.

@mbauman
Copy link
Sponsor Member

mbauman commented May 16, 2024

This would just be all kinds of breaking. Just skim across all the usages of isdigit throughout the ecosystem — so many packages do things that are effectively isdigit(c) && c - '0' or isdigit(c) && parse(Int, c). Documented or not, this is the behavior and folks rely on it.

mbauman pushed a commit that referenced this pull request May 16, 2024
The `isdigit` function only checks for ASCII digits — this PR clarifies
the docs to make that explicit.

See
#54447 (comment).
Closes #54447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings kind:breaking This change will break code kind:speculative Whether the change will be implemented is speculative
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants