Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ASCII host values #954

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Validate ASCII host values #954

wants to merge 1 commit into from

Conversation

mjpieters
Copy link
Contributor

Host values must match the unreserved, sub-delims or pct-encoded grammar
rules. The IDNA encoder already checks for this, but for ASCII values we
skip IDNA encoding.

If the invalid character is @, or : followed by @ further in the
host value, tack on advice about using authority instead of host.

Fixes #880

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> (e.g. 588.bugfix)
    • if you don't have an issue_id change it to the pr id after creating the PR
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: Fix issue with non-ascii contents in doctest text files.

CHANGES/8080.bugfix.rst Outdated Show resolved Hide resolved
CHANGES/8080.bugfix.rst Outdated Show resolved Hide resolved
@webknjaz
Copy link
Member

@mjpieters do you want me to reject the release build, wait and include this PR?

@webknjaz
Copy link
Member

Not sure why the changelog bot is failing to notice the towncrier fragment... Not sure why but it's safe to ignore for now.

This comment was marked as outdated.

@webknjaz
Copy link
Member

[towncrier-fragments]:13: : Spell check: hostnames: Started rejecting ASCII hostnames with invalid characters. For host strings that.

@mjpieters this can be fixed by adding "hostnames" to docs/spelling_wordlist.txt. Though, I hope to integrate codespell later that would replace sphinxcontrib-spelling and seems to have less false positives.

@mjpieters
Copy link
Contributor Author

While implementing my PR I noticed that if you used a non-ASCII hostname with @ or any other invalid character (so not unreserved, percent encoded or sub-delims), then idna.encode() would raise an exception. But when I tried to reproduce this behaviour, just now, I noticed we swallow that exception, because it is a UnicodeError subclass 🤦 .

I'm not sure why the encoder uses except UnicodeError with a fallback to host.encode('idna').decode('ascii'), because that completely defeats the valid codepoint check.

I can hoist the validation to apply to non-ASCII hostnames too but that means I need to explicitly test for all disallowed ASCII codepoint (instead of the negative character classes I use now).

I'll go search through the commit history to see if I cannot find out more here.

@mjpieters
Copy link
Contributor Author

@mjpieters this can be fixed by adding "hostnames" to docs/spelling_wordlist.txt. Though, I hope to integrate codespell later that would replace sphinxcontrib-spelling and seems to have less false positives.

:-D I already have that change staged locally :-P

@webknjaz
Copy link
Member

@mjpieters so do you think it's worth putting into a separate release? The automation is there so it wouldn't take as long to make new releases from now on...

@mjpieters mjpieters force-pushed the validate_host branch 2 times, most recently from 3f8f6a8 to 874b429 Compare November 20, 2023 15:25
@mjpieters
Copy link
Contributor Author

Not sure why the changelog bot is failing to notice the towncrier fragment... Not sure why but it's safe to ignore for now.

It is looking for the files in the wrong location, news/ instead of CHANGES/.

@mjpieters
Copy link
Contributor Author

@mjpieters do you want me to reject the release build, wait and include this PR?

I think it's fine to put this in a next release. I don't see this as urgent.

@webknjaz
Copy link
Member

@mjpieters do you want me to reject the release build, wait and include this PR?

I think it's fine to put this in a next release. I don't see this as urgent.

Okay, will proceed with releasing, then. The build is half-way ready: https://github.com/aio-libs/yarl/actions/runs/6931722870.

@mjpieters
Copy link
Contributor Author

I'll raise a separate issue for IDNA host encoding. I can see why the UnicodeError exception handler is there but I think it needs some reworking.

This PR can focus on ASCII hostnames only, as a first step.

CHANGES/880.bugfix.rst Outdated Show resolved Hide resolved
@mjpieters
Copy link
Contributor Author

Okay, will proceed with releasing, then. The build is half-way ready: https://github.com/aio-libs/yarl/actions/runs/6931722870.

What a glorious deployment pipeline we have! 😁

@webknjaz
Copy link
Member

Not sure why the changelog bot is failing to notice the towncrier fragment... Not sure why but it's safe to ignore for now.

It is looking for the files in the wrong location, news/ instead of CHANGES/.

Oh, that makes sense — this is because I moved the config to a dedicated file and haven't updates the bot to account for that. Thanks for pointing it out!

Host values must match the unreserved, sub-delims or pct-encoded grammar
rules. The IDNA encoder already checks for this, but for ASCII values we
skip IDNA encoding.

If the invalid character is `@`, or `:` followed by `@` further in the
host value, tack on advice about using `authority` instead of `host`.
@webknjaz
Copy link
Member

Okay, will proceed with releasing, then. The build is half-way ready: https://github.com/aio-libs/yarl/actions/runs/6931722870.

What a glorious deployment pipeline we have! 😁

Hopefully, I'll make it scalable and reuse for other projects. I have similar things in many other repos but they all have tiny differences which makes them hard to reuse as is.
I made the CI run tests from the exact sdists we publish and install the wheels instead of rebuilding from Git source in each job. Though, I broke codecov reporting along the way so ignore it for now, I'll fix it later, hopefully.

@webknjaz
Copy link
Member

Okay, will proceed with releasing, then. The build is half-way ready: https://github.com/aio-libs/yarl/actions/runs/6931722870.

It took a few reattempts to get the last bit right... But now we have v1.9.3:

@webknjaz webknjaz added the bug label Nov 21, 2023
@webknjaz
Copy link
Member

Though, I broke codecov reporting along the way so ignore it for now, I'll fix it later, hopefully.

Fixed it. The coverage metrics are back up!

@webknjaz
Copy link
Member

I'm not sure why the encoder uses except UnicodeError with a fallback to host.encode('idna').decode('ascii'), because that completely defeats the valid codepoint check.

I checked the coverage reported before I broke it temporarily, and turns out that the line you're talking about is the only one uncovered in the entire project: https://app.codecov.io/gh/aio-libs/yarl/commit/65333fab5f5b2bcce9fbb6aa810e9618fa10c6f0/blob/yarl/_url.py#L1176.

So I guess this might also need some extra extensive testing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

basic-auth user and password are converted to lowercase when passed as part of "host" to URL.build
2 participants