Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does yarl encode query params differently? #301

Open
bsolomon1124 opened this issue Apr 11, 2019 · 4 comments
Open

Why does yarl encode query params differently? #301

bsolomon1124 opened this issue Apr 11, 2019 · 4 comments

Comments

@bsolomon1124
Copy link

bsolomon1124 commented Apr 11, 2019

Note: I'm not filing this issue to say "yarl is wrong, urllib is right"---I know that the author of yarl knows RFC 3986 way, way better than I ever will and it could be yarl that is doing the better job of adhering to it. But I'm curious what motivates the difference in behavior below.

Summary: We have an API endpoint where one of the query string params is url=. aiohttp (via yarl.URL) and requests (via urllib.parse.urlencode) encode the query string differently.

>>> from yarl import URL
>>> url = URL("http://api.example.com")
>>> params = {
...     "url": "https://www.katherinetimes.com.au/story/6005621/wife-of-ex-nissan-boss-ghosn-leaves-japan/?src=rss",
...     "api_key": "XXXX",
... }
>>> url.with_query(params)
URL('http://api.example.com/?url=https://www.katherinetimes.com.au/story/6005621/wife-of-ex-nissan-boss-ghosn-leaves-japan/?src%3Drss&api_key=XXXX')

Notice that the ? in the URL (within the query string) is not encoded:

... japan/?src%3Drss&api ...

With the ? not encoded, the API endpoint gives us back a 400 response.

Now with urllib.parse:

>>> parts = urllib.parse.urlparse(url.human_repr())
>>> parts
ParseResult(scheme='http', netloc='api.example.com', path='/', params='', query='', fragment='')
>>> urllib.parse.urlunparse(
...     (
...         parts.scheme,
...         parts.netloc,
...         parts.path,
...         parts.params,
...         urllib.parse.urlencode(params),
...         parts.fragment
...     )
... )
'http://api.example.com/?url=https%3A%2F%2Fwww.katherinetimes.com.au%2Fstory%2F6005621%2Fwife-of-ex-nissan-boss-ghosn-leaves-japan%2F%3Fsrc%3Drss&api_key=XXXX'

Here the ? in the URL (within the query string) is encoded (as are the /).

... japan%2F%3Fsrc%3Drss&api ...

With the ? encoded, the API endpoint gives us back a 200 response.

So, urlencode has a default of safe='' and encodes pretty much everything. Is there a reason that yarl seems to diverge from this?

Now, when I look at RFC 3986, it actually seems like the / and ? should not need to be encoded:

   query         = *( pchar / "/" / "?" )

That is ironic, that it seems like yarl gets it "right" by leaving them unencoded, while urllib does too much work but then get us a 200 response.

So to make my question concrete:

  • Are my assumptions/understanding here correct?
  • Have you seen other situations like this?
  • Within aiohttp, is there optionality to specify something like safe, or a callable that is called to the actual encoding of params?

The problem here is that even if yarl is "right," it basically forces the encoding even if I pass the full raw URL string to something like aiohttp.request:

>>> URL('http://api.example.com?url=https%3A%2F%2Fwww.katherinetimes.com.au%2Fstory%2F6005621%2Fwife-of-ex-nissan-boss-ghosn-leaves-japan%2F%3Fsrc%3Drss&api_key=xxx')                                                                                
URL('http://api.example.com/?url=https://www.katherinetimes.com.au/story/6005621/wife-of-ex-nissan-boss-ghosn-leaves-japan/?src%3Drss&api_key=xxx')
>>> u = URL('http://api.example.com?url=https%3A%2F%2Fwww.katherinetimes.com.au%2Fstory%2F6005621%2Fwife-of-ex-nissan-boss-ghosn-leaves-japan%2F%3Fsrc%3Drss&api_key=xxx')
>>> u.raw_query_string
'url=https://www.katherinetimes.com.au/story/6005621/wife-of-ex-nissan-boss-ghosn-leaves-japan/?src%3Drss&api_key=xxx'

In aiohttp, the question then becomes "how can I override this behavior?" This is the answer I can come up with:

async def fullencode(params, *, _bp=BASE_PARTS):
    return URL(urllib.parse.urlunparse((
        _bp.scheme, _bp.netloc, _bp.path, _bp.params,
        urllib.parse.urlencode(params),
        _bp.fragment
    )), encoded=True)

Then pass await fullencode(params) to request(). (This assumes you have some base URL that forms _bp above.)

@bsolomon1124
Copy link
Author

bsolomon1124 commented Apr 11, 2019

Per a part in the aiohttp docs, at Passing Parameters in URLs:

You can also pass str content as param, but beware – content is not encoded by library. Note that + is not encoded.

But this seems to be contradicted by the section right below it!

Canonization encodes host part by IDNA codec and applies requoting to path and query parts.

And this is also contradicted empircally. If I pass the str

params = "url=https%3A%2F%2Fwww.katherinetimes.com.au%2Fstory%2F6005621%2Fwife-of-ex-nissan-boss-ghosn-leaves-japan%2F%3Fsrc%3Drss&api_key=XXXX"

Then resp.url is still url=http://www.parkeschampionpost.com.au/story/5991579/brunei-jets-ban-sought-over-gay-death-law/?src%3Drss&api_key=XXXX.

Additionally, it would be nice for with_query() to take the encoded param; currently, it hardcodes it.

@claudiopastorini
Copy link

I instead managed this problem passing the entire URL as a string to session.get().

In my particular case, I have a query parameter with some /:

'nextToken': 'AlmxMMAC3vpYMwhOWtKd+/rMNvA+w0GT/It9I451d0g0JuzSPCP4ETtG+WVETwCsO/fwHEFW3OBUvodMfKcBO2XZexjiCSgFtdbMkKARpeh7JLIiJOxfO63pYn6k1cyxJfM5bojKaKPHuDLlUOxJNarbAsZap5jdIdlbtPBklzR2zL4HU0sPg04dCGNnnNvR9jw+przYkkRx'

So I decided to use urllib:

params = urllib.parse.urlencode(params, safe='')
encoded_url = f"{url}?{params}"

And then:

async with session.get(encoded_url, headers=signed_headers) as response:

Thanks @bsolomon1124 for your analysis, I will watch this issue for an update

@dralley
Copy link

dralley commented Jul 8, 2022

Perhaps yarl could provide a "maximum compatibility" option which behaves the same as urllib? Yarl may be correct here but when wget and http and curl and everything else works with these URLs it's difficult to make that argument.

@webknjaz
Copy link
Member

and curl

Curl is a bad example here. It doesn't really sanitize/encode things. If you feed it with garbage, it forwards that garbage in its requests.

The others arguably target a different audience — end-users (their API is CLI) and are not libraries. For libraries, especially the low-level ones, it's more common to give more flexibility and responsibility to their users.

So at the moment I don't know how I feel about this "max compat" idea. yarl does provide a way to feed it "raw"/pre-encoded data with the encoded=True argument in the constructors: https://yarl.readthedocs.io/en/latest/api.html#yarl-api. So on yarl's side, the possibility to represent whatever encoding you want is present.
I guess maybe an extra constructor (implemented as @classmethod) with a different encoding strategy could be useful but what if in the future there will be more encoding quirk requests? I think this has a change of quickly becoming unsustainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants