Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement idna on top of ICU4X #923

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Reimplement idna on top of ICU4X #923

wants to merge 40 commits into from

Conversation

hsivonen
Copy link
Contributor

@hsivonen hsivonen commented Apr 9, 2024

Opening as draft PR to enable early feedback while the dependency remains unlanded in ICU4X.

The motivation of reformulating the idna crate on top of ICU4X is to be able to move Firefox's IDNA handling to use ICU4X (instead of the current combination of ICU4C and very old code). The ICU4X normalizer is faster than unicode-normalization and the ICU4X normalizer represents UTS 46 data as a normalization as opposed to representing it separately like the idna crate currently does.

The benchmarks in the idna crate itself show this PR to result in faster performance. This is also more correct than the old code: I removed skipping of the ContextJ tests from the harness that runs the UTS 46 test suite.

See the added README for removed capabilities. I searched for GitHub for public code using the idna crate, and I believe the removals to be mostly not need action from the ecosystem and to be tolerable when they do.

For projects that use ICU4X for normalization (or collation), this change has the benefit of deduplicating data across normalization and IDNA handling. There is the ecosystem risk of causing projects that use unicode-normalization for normalization in ways other than as a dependency of idna to end up with more data. One way to mitigate that (already preliminarily discussed with the maintainer) would be to introduce a cargo feature to unicode-normalization that would delegate the unicode-normalization internals to ICU4X (better performance, more crates in the dependency tree).

Not properly investigated yet: Binary size impact.

@valenting
Copy link
Collaborator

I think you need to explicitly add a dependency for the icu crate, instead of using a relative path

idna/Cargo.toml Outdated Show resolved Hide resolved
Copy link
Collaborator

@valenting valenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a direct dependency on the icu crates

@hsivonen
Copy link
Contributor Author

Yeah, the dependency declaration will change when this becomes a non-draft.

@hsivonen
Copy link
Contributor Author

hsivonen commented Apr 24, 2024

Using the demo https://github.com/hsivonen/urldemo (strip=true, lto=true, opt_level="z") and binaryen wasm-opt -Oz on the result, I get 215085 bytes with this patch and 310986 without, so this should not only improve performance but should also make (Wasm at least) binary size smaller.

@hsivonen hsivonen mentioned this pull request Apr 24, 2024
Copy link
Contributor

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who spent a bunch of time optimizing the idna crate a few years ago, cool to see more speedups here! Here's a bunch of stylistic suggestions, which could be applied more generally to a bunch of the code that was rewritten here.

idna/src/deprecated.rs Show resolved Hide resolved
idna/src/deprecated.rs Show resolved Hide resolved
idna/src/deprecated.rs Outdated Show resolved Hide resolved
idna/src/deprecated.rs Outdated Show resolved Hide resolved
idna/src/deprecated.rs Show resolved Hide resolved
idna/src/punycode.rs Show resolved Hide resolved
idna/src/punycode.rs Outdated Show resolved Hide resolved
idna/src/punycode.rs Show resolved Hide resolved
…or that does not check hyphens in positions 3 and 4
…s required)

Since other changes in this changeset require a semver break anyway, this
change takes a semver break in the case of `default-features = false` in
order to avoid a future semver break if in the future a need to add a
bring-your-own-data (using `icu_provider`) constructor for `Uts46`
shows up.
@hsivonen
Copy link
Contributor Author

hsivonen commented May 3, 2024

Since these changes require a semver increment anyway, I took the opportunity to add a currently-required compiled_data feature in order to future-proof against having to take a semver break if a use case for dynamic data loading using the ICU4X provider shows up. (CC @sffc )

From my perspective, this PR is now done expect for changing the ICU4X dependencies to point to crates.io once unicode-org/icu4x#4712 has landed and been published to crates.io. Leaving this PR in the draft state until then, but review is welcome before changing to non-draft.

@hsivonen hsivonen marked this pull request as ready for review May 23, 2024 07:11
Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 78.66795% with 221 lines in your changes are missing coverage. Please review.

Please upload report for BASE (main@de947ab). Learn more about missing BASE report.

Files Patch % Lines
idna/src/uts46.rs 76.89% 165 Missing ⚠️
idna/src/punycode.rs 73.13% 18 Missing ⚠️
idna/tests/deprecated.rs 82.35% 18 Missing ⚠️
idna/src/lib.rs 44.44% 10 Missing ⚠️
idna/tests/uts46.rs 85.41% 7 Missing ⚠️
idna/src/deprecated.rs 96.29% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #923   +/-   ##
=======================================
  Coverage        ?   79.94%           
=======================================
  Files           ?       22           
  Lines           ?     4208           
  Branches        ?        0           
=======================================
  Hits            ?     3364           
  Misses          ?      844           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hsivonen
Copy link
Contributor Author

I looked at the red lines for non-test code in the coverage report, and it's not as awful as the percentages suggest. In particular, it looks like const-evaluated code shows up as uncovered. Could be better, of course.

@hsivonen
Copy link
Contributor Author

I did the minimum possible version bump for url before realizing that url now rejects URLs whose domain violates the ContextJ rule. (Additionally, the error classification for URLs that violate the forbidden domain code point rule changed, but that change doesn't change whether or not these is an error.)

@hsivonen hsivonen requested a review from valenting May 23, 2024 07:57
@hsivonen
Copy link
Contributor Author

From the point of view of url users, the changes are:

  • Change in dependency tree. I believe not a semver break per community standards.
  • Change in MSRV. Not a semver break per community standards.
  • Change in which error kind is reported for forbidden domain code point. Rather silly to treat as a semver break.
  • URLs whose domain violates the ContextJ rule are now rejected (as they should have been all along). Semver break or not?

@hsivonen
Copy link
Contributor Author

The dependency version increment is about dealing with an issue in the transitive dependency declarations which caused a problem in the case where a dependent already had the _data crates downloaded at 1.4.0.

Note that 1.5 versions have also already been released.

GitHub shows this changeset as "Changes requested". I believe I've addressed the request already, but I don't see what GitHub UI action I need to take to dismiss the "Changes requested" state.

@djc
Copy link
Contributor

djc commented May 29, 2024

From the point of view of url users, the changes are:

* Change in dependency tree. I believe not a semver break per community standards.

* Change in MSRV. Not a semver break per community standards.

* Change in which error kind is reported for forbidden domain code point. Rather silly to treat as a semver break.

* URLs whose domain violates the ContextJ rule are now rejected (as they should have been all along). Semver break or not?

Doesn't sound like a semver break is needed.

GitHub shows this changeset as "Changes requested". I believe I've addressed the request already, but I don't see what GitHub UI action I need to take to dismiss the "Changes requested" state.

I think only the maintainers themselves can dismiss this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants