Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more type checking in params/form-data #3170

Open
1 task
trim21 opened this issue Apr 17, 2024 · 11 comments
Open
1 task

add more type checking in params/form-data #3170

trim21 opened this issue Apr 17, 2024 · 11 comments

Comments

@trim21
Copy link
Contributor

trim21 commented Apr 17, 2024

Problem

we should check input value types. Current behavoir which simply convert input value to str(v) is not very ideal.

for example:

import httpx


r = httpx.post(
    "https://httpbin.org/post",
    data={
        "content": bytes.fromhex(
            "4f5f13d16da9f10d013933673a1ef4e7b974dbe3"
        ),  # random non-printable bytes
    },
)
print(r.json()["form"])

will get {'content': "b'O_\\x13\\xd1m\\xa9\\xf1\\r\\x0193g:\\x1e\\xf4\\xe7\\xb9t\\xdb\\xe3'"}

Proposal

add type checking and more coerce.

for example:

form

  • convert int/float to str with str(v)
  • bool (Represent as JSON-style "true"/"false")
  • None (Represent as empty string)
  • Enum (Use .value and then as above.)
  • TypeError for anything else.

query

except about, query should support bytes with percent encoding and do not throw.

multipart_data

not sure how to handle this....

This will cause breaking changes, so we should consider do a minor version bump.


related: #1500 #1539 #1608

httpx/httpx/_content.py

Lines 136 to 149 in 4b85e6c

def encode_urlencoded_data(
data: RequestData,
) -> tuple[dict[str, str], ByteStream]:
plain_data = []
for key, value in data.items():
if isinstance(value, (list, tuple)):
plain_data.extend([(key, primitive_value_to_str(item)) for item in value])
else:
plain_data.append((key, primitive_value_to_str(value)))
body = urlencode(plain_data, doseq=True).encode("utf-8")
content_length = str(len(body))
content_type = "application/x-www-form-urlencoded"
headers = {"Content-Length": content_length, "Content-Type": content_type}
return headers, ByteStream(body)

httpx/httpx/_utils.py

Lines 56 to 68 in 4b85e6c

def primitive_value_to_str(value: PrimitiveData) -> str:
"""
Coerce a primitive data type into a string value.
Note that we prefer JSON-style 'true'/'false' for boolean values here.
"""
if value is True:
return "true"
elif value is False:
return "false"
elif value is None:
return ""
return str(value)

@trim21 trim21 changed the title add more type checking in headers/params/form-data add more type checking in params/form-data Apr 17, 2024
@trim21
Copy link
Contributor Author

trim21 commented May 31, 2024

there is also another issue, query string should support bytes natively.

for example:

import httpx
import urllib.parse

r = httpx.get(
    "https://httpbin.org/get",
    params={"q": bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")},
)

print(urllib.parse.quote(bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")))

print(r.text)

expected url should be https://httpbin.org/get?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E, instead of ValueError or TypeError, or currently https://httpbin.org/get?q=b\"\\xe1\\xee\\x0e'4\\x98oT\\x19\\xbbl\\xb6%25%2B\\xd97q\\x83D\\x0e\"

@tomchristie
Copy link
Member

query string should support bytes natively.

I'm not convinced by that. Why?

@trim21
Copy link
Contributor Author

trim21 commented Jun 3, 2024

query string should support bytes natively.

I'm not convinced by that. Why?

because percent encoding can take native bytes as input

@trim21
Copy link
Contributor Author

trim21 commented Jun 3, 2024

when we say it support "string", it's actually assuming percent encoded bytes is utf8 content.

consider this:

from urllib.parse import quote, quote_from_bytes


utf8_s = "你好"

assert quote(utf8_s) == quote_from_bytes(utf8_s.encode())

https://example.com/?q=你好 is actually https://example.com/?q=%E4%BD%A0%E5%A5%BD

that's also what urllib.parse.quote is doing by default, it encode str (utf8 by default) to bytes then use quote_from_bytes

@tomchristie
Copy link
Member

I think this is confusing a couple of separate issues...

  • Should str-coercion be performed.
  • Should bytes be a supported type.

I'm think there's a good case that we shouldn't str-coerce types outside of the expected range. I don't currently think there's a good case that we should support bytes.

@trim21
Copy link
Contributor Author

trim21 commented Jun 3, 2024

I would vote for both...

consider this:

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

looks fine, right?

but actually request url is ?q=%EF%BF%BD%EF%BF%BD%0E%274%EF%BF%BDoT%19%EF%BF%BDl%EF%BF%BD%25%2B%EF%BF%BD7q%EF%BF%BDD%0E&a=1, which it not right.

correct value of q is bytes.fromhex('e1ee0e2734986f5419bb6cb6252bd9377183440e'), but httpx send url with q bytes.fromhex('efbfbdefbfbd0e2734efbfbd6f5419efbfbd6cefbfbd252befbfbd3771efbfbd440e')


also send bytes as query value is supported by requests, which is handled correctly

@cknv
Copy link

cknv commented Jun 3, 2024

Just to chip in a bit here:

I have been using httpx to interact with a legacy system and as test client to mimic that legacy system calling back into our starlette based API.
Some of the query fields can occasionally be bytes, these bytes are either just bytes that only has meaning to systems outside of my control, or UTF-16 encoded strings (which as an aside can also mean that I have to produce a mixed content query string).

In order to get the correct values through httpx and not 'b"value"' I have had to resort to manually building the query myself with the stdlib and appending it on the URL.
This does work, but it took some time to arrive at and I wouldn't exactly call it "ergonomic".

But I also realize that my case is not common.

@tomchristie
Copy link
Member

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

@trim21
Copy link
Contributor Author

trim21 commented Jun 3, 2024

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

yes, that's a problem, I think.

if quote_from_bytes(unquote_to_bytes("%EE")) then everything is fine, in this case.

@tomchristie
Copy link
Member

@cknv Could you give a (simplified) example please?

@cknv
Copy link

cknv commented Jun 3, 2024

Sure @tomchristie! Get ready for some flashback to the dark ages, before UTF8 won the encoding wars.

from urllib.parse import urlencode
import httpx

client = httpx.Client()

values = {
    "expected-as-utf8": "Hello world",
    "expected-as-int": 1234,
    "expected-as-latin1": "Hello world".encode("latin1"),
    "expected-as-binary": b"\x01\x02\x03\x04",
}

encoded_query = urlencode(values)

# replaced .get with .build_request
# as making requests does not seem important for demonstrating my workaround
correctly_encoded_request = client.build_request(
    method="GET",
    url=f"http://localhost?{encoded_query}",
)
incorrecty_encoded_request = client.build_request(
    method="GET",
    url="http://localhost",
    params=values,
)

# compare and contrast:
print(correctly_encoded_request.url.params)
print(incorrecty_encoded_request.url.params)

When actually rereading the code, it wasn't UTF-16 that was used, but instead actually latin1, so I replaced that in the example, although either of them would need to be encoded into bytes.

Narrowing our focus to just the plain byte value, it should be expected-as-binary=%01%02%03%04, and my workaround indeed does produce that.
Without the workaround and just feeding the values dict into the request builders params the value is reprd (via str) and produces expected-as-binary=b%27%5Cx01%5Cx02%5Cx03%5Cx04%27 which is the percent encoded version of expected-as-binary=b'\x01\x02\x03\x04'.

The latin1 value works without the workaround as long as you stick to the subset that is identical with ASCII, but we do at times get data through that isn't just that subset.

To add some context the server receiving the request is written in C, and has a very lax attitude to encodings because it's all just bytes anyway.

If httpx would percent encode bytes, I would probably be able to remove that workaround. However I would not be surprised if there are other ways to send bytes over http either established in an RFC or just by tradition, if so I would not know which one is the most appropriate, but it appears like percent encoding is at least in the RFC.

Finally, I would like to reiterate that I know that this is way outside the mainstream, I have mostly accepted my workaround, but I at least wanted to raise my hand when someone else expressed a similar problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants