Reimplement idna on top of ICU4X #923

hsivonen · 2024-04-09T16:17:25Z

Opening as draft PR to enable early feedback while the dependency remains unlanded in ICU4X.

The motivation of reformulating the idna crate on top of ICU4X is to be able to move Firefox's IDNA handling to use ICU4X (instead of the current combination of ICU4C and very old code). The ICU4X normalizer is faster than unicode-normalization and the ICU4X normalizer represents UTS 46 data as a normalization as opposed to representing it separately like the idna crate currently does.

The benchmarks in the idna crate itself show this PR to result in faster performance. This is also more correct than the old code: I removed skipping of the ContextJ tests from the harness that runs the UTS 46 test suite.

See the added README for removed capabilities. I searched for GitHub for public code using the idna crate, and I believe the removals to be mostly not need action from the ecosystem and to be tolerable when they do.

For projects that use ICU4X for normalization (or collation), this change has the benefit of deduplicating data across normalization and IDNA handling. There is the ecosystem risk of causing projects that use unicode-normalization for normalization in ways other than as a dependency of idna to end up with more data. One way to mitigate that (already preliminarily discussed with the maintainer) would be to introduce a cargo feature to unicode-normalization that would delegate the unicode-normalization internals to ICU4X (better performance, more crates in the dependency tree).

Not properly investigated yet: Binary size impact.

…erformance

…PI behavior

valenting · 2024-04-11T08:06:23Z

I think you need to explicitly add a dependency for the icu crate, instead of using a relative path

idna/Cargo.toml

valenting

We need to add a direct dependency on the icu crates

hsivonen · 2024-04-18T10:41:23Z

Yeah, the dependency declaration will change when this becomes a non-draft.

…false

…-level function

hsivonen · 2024-04-24T12:45:29Z

Using the demo https://github.com/hsivonen/urldemo (strip=true, lto=true, opt_level="z") and binaryen wasm-opt -Oz on the result, I get 215085 bytes with this patch and 310986 without, so this should not only improve performance but should also make (Wasm at least) binary size smaller.

djc

As someone who spent a bunch of time optimizing the idna crate a few years ago, cool to see more speedups here! Here's a bunch of stylistic suggestions, which could be applied more generally to a bunch of the code that was rewritten here.

idna/src/deprecated.rs

idna/src/punycode.rs

…or that does not check hyphens in positions 3 and 4

…s required) Since other changes in this changeset require a semver break anyway, this change takes a semver break in the case of `default-features = false` in order to avoid a future semver break if in the future a need to add a bring-your-own-data (using `icu_provider`) constructor for `Uts46` shows up.

hsivonen · 2024-05-03T09:13:59Z

Since these changes require a semver increment anyway, I took the opportunity to add a currently-required compiled_data feature in order to future-proof against having to take a semver break if a use case for dynamic data loading using the ICU4X provider shows up. (CC @sffc )

From my perspective, this PR is now done expect for changing the ICU4X dependencies to point to crates.io once unicode-org/icu4x#4712 has landed and been published to crates.io. Leaving this PR in the draft state until then, but review is welcome before changing to non-draft.

…configurable

codecov · 2024-05-23T07:35:01Z

Codecov Report

Attention: Patch coverage is 78.66795% with 221 lines in your changes are missing coverage. Please review.

Please upload report for BASE (main@de947ab). Learn more about missing BASE report.

Files	Patch %	Lines
idna/src/uts46.rs	76.89%	165 Missing ⚠️
idna/src/punycode.rs	73.13%	18 Missing ⚠️
idna/tests/deprecated.rs	82.35%	18 Missing ⚠️
idna/src/lib.rs	44.44%	10 Missing ⚠️
idna/tests/uts46.rs	85.41%	7 Missing ⚠️
idna/src/deprecated.rs	96.29%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #923   +/-   ##
=======================================
  Coverage        ?   79.94%           
=======================================
  Files           ?       22           
  Lines           ?     4208           
  Branches        ?        0           
=======================================
  Hits            ?     3364           
  Misses          ?      844           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hsivonen · 2024-05-23T07:51:31Z

I looked at the red lines for non-test code in the coverage report, and it's not as awful as the percentages suggest. In particular, it looks like const-evaluated code shows up as uncovered. Could be better, of course.

hsivonen · 2024-05-23T07:55:19Z

I did the minimum possible version bump for url before realizing that url now rejects URLs whose domain violates the ContextJ rule. (Additionally, the error classification for URLs that violate the forbidden domain code point rule changed, but that change doesn't change whether or not these is an error.)

hsivonen · 2024-05-23T08:03:07Z

From the point of view of url users, the changes are:

Change in dependency tree. I believe not a semver break per community standards.
Change in MSRV. Not a semver break per community standards.
Change in which error kind is reported for forbidden domain code point. Rather silly to treat as a semver break.
URLs whose domain violates the ContextJ rule are now rejected (as they should have been all along). Semver break or not?

hsivonen · 2024-05-29T04:59:37Z

The dependency version increment is about dealing with an issue in the transitive dependency declarations which caused a problem in the case where a dependent already had the _data crates downloaded at 1.4.0.

Note that 1.5 versions have also already been released.

GitHub shows this changeset as "Changes requested". I believe I've addressed the request already, but I don't see what GitHub UI action I need to take to dismiss the "Changes requested" state.

djc · 2024-05-29T08:24:00Z

From the point of view of url users, the changes are:

* Change in dependency tree. I believe not a semver break per community standards.

* Change in MSRV. Not a semver break per community standards.

* Change in which error kind is reported for forbidden domain code point. Rather silly to treat as a semver break.

* URLs whose domain violates the ContextJ rule are now rejected (as they should have been all along). Semver break or not?

Doesn't sound like a semver break is needed.

GitHub shows this changeset as "Changes requested". I believe I've addressed the request already, but I don't see what GitHub UI action I need to take to dismiss the "Changes requested" state.

I think only the maintainers themselves can dismiss this.

hsivonen added 14 commits March 20, 2024 17:43

Reimplement idna on top of ICU4X

a8977a8

Add an even faster lower-case ASCII letter path to avoid regressing p…

09765af

…erformance

Comments and verify_dns_length tweak

7e929ce

Parametrize internal vs. external Punycode caller; restore external A…

f413387

…PI behavior

Add bench for to_ascii on an already-Punycode name

71c03b9

Avoid re-encoding Punycode when possible

9af00cb

Pass through the input slice in many more cases

dc8f301

Add testing for the simultaneous mode

41e0192

Omit the invalid domain character check on the url side

41f2107

Document that Punycode labels must result in non-ASCII

4d7d41a

Rename files called uts46.rs to deprecated.rs

98ca752

Rename uts46bis to uts46

4bbabe9

Tweak docs

7dc0082

Avoid useless copying and useless UTF-8 decode

f8eb96e

valenting reviewed Apr 12, 2024

View reviewed changes

idna/Cargo.toml Outdated Show resolved Hide resolved

valenting requested changes Apr 12, 2024

View reviewed changes

hsivonen added 5 commits April 15, 2024 14:29

Use inline(never) to optimize binary size

eb6e3d5

Split CheckHyphens into a separate concern form the ASCII deny list

ce3d4d1

Make the ASCII deny list customizable

6672161

Better docs and top-level functions

90fe4b3

Parameter for VerifyDNSLength

50381ff

hsivonen added 6 commits April 18, 2024 14:11

Restore support for transitional processing to minimize breakage

8268c5a

In the deprecated API, use empty deny list with use_std3_ascii_rules=…

999bef4

…false

Tweak docs

b277c85

Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions

980348c

Use idna crate top-level function in the url crate to dogfood the top…

4efd589

…-level function

Add an Usage section to the README

da6cf50

hsivonen mentioned this pull request Apr 24, 2024

WASM file size #557

Open

djc reviewed Apr 24, 2024

View reviewed changes

hsivonen added 7 commits April 26, 2024 14:32

Add an early return to map_transitional for readability

d938024

Document internal vs. external Punycode caller differences

679edb9

Per discussion with Valentin, revert deprecated API to the old behavi…

4f605c9

…or that does not check hyphens in positions 3 and 4

Add comments about not fixing deprecated API

bbf4308

Merge branch 'main' into icu4x

e842dae

Add a comment explaining FailFast in deprecated.rs

6690c49

hsivonen added 5 commits May 20, 2024 18:53

Remove remark about spec violation by making root dot permissibility …

52137e7

…configurable

Clarify README about IDNA 2003/2008

081f44b

Add a historical remark to the README

aaa7a40

Fix typo

8b03034

Depend on crates.io versions of icu_normalizer and icu_properties

c8a4bd3

hsivonen marked this pull request as ready for review May 23, 2024 07:11

Address clippy lints

be3db8e

Update versions

6020673

hsivonen requested a review from valenting May 23, 2024 07:57

Increment dependency versions

245c514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement idna on top of ICU4X #923

Reimplement idna on top of ICU4X #923

hsivonen commented Apr 9, 2024

valenting commented Apr 11, 2024

valenting left a comment

hsivonen commented Apr 18, 2024

hsivonen commented Apr 24, 2024 •

edited

djc left a comment

hsivonen commented May 3, 2024

codecov bot commented May 23, 2024 •

edited

hsivonen commented May 23, 2024

hsivonen commented May 23, 2024

hsivonen commented May 23, 2024

hsivonen commented May 29, 2024

djc commented May 29, 2024

Reimplement idna on top of ICU4X #923

Are you sure you want to change the base?

Reimplement idna on top of ICU4X #923

Conversation

hsivonen commented Apr 9, 2024

valenting commented Apr 11, 2024

valenting left a comment

Choose a reason for hiding this comment

hsivonen commented Apr 18, 2024

hsivonen commented Apr 24, 2024 • edited

djc left a comment

Choose a reason for hiding this comment

hsivonen commented May 3, 2024

codecov bot commented May 23, 2024 • edited

Codecov Report

hsivonen commented May 23, 2024

hsivonen commented May 23, 2024

hsivonen commented May 23, 2024

hsivonen commented May 29, 2024

djc commented May 29, 2024

hsivonen commented Apr 24, 2024 •

edited

codecov bot commented May 23, 2024 •

edited