add more type checking in params/form-data #3170

trim21 · 2024-04-17T11:54:22Z

Initially raised as discussion bytes in form is handled in a very strange way #3168

Problem

we should check input value types. Current behavoir which simply convert input value to str(v) is not very ideal.

for example:

import httpx


r = httpx.post(
    "https://httpbin.org/post",
    data={
        "content": bytes.fromhex(
            "4f5f13d16da9f10d013933673a1ef4e7b974dbe3"
        ),  # random non-printable bytes
    },
)
print(r.json()["form"])

will get {'content': "b'O_\\x13\\xd1m\\xa9\\xf1\\r\\x0193g:\\x1e\\xf4\\xe7\\xb9t\\xdb\\xe3'"}

Proposal

add type checking and more coerce.

for example:

form

convert int/float to str with str(v)
bool (Represent as JSON-style "true"/"false")
None (Represent as empty string)
Enum (Use .value and then as above.)
TypeError for anything else.

query

except about, query should support bytes with percent encoding and do not throw.

multipart_data

not sure how to handle this....

This will cause breaking changes, so we should consider do a minor version bump.

related: #1500 #1539 #1608

httpx/httpx/_content.py

Lines 136 to 149 in 4b85e6c

    
           def encode_urlencoded_data( 
        
               data: RequestData, 
        
           ) -> tuple[dict[str, str], ByteStream]: 
        
               plain_data = [] 
        
               for key, value in data.items(): 
        
                   if isinstance(value, (list, tuple)): 
        
                       plain_data.extend([(key, primitive_value_to_str(item)) for item in value]) 
        
                   else: 
        
                       plain_data.append((key, primitive_value_to_str(value))) 
        
               body = urlencode(plain_data, doseq=True).encode("utf-8") 
        
               content_length = str(len(body)) 
        
               content_type = "application/x-www-form-urlencoded" 
        
               headers = {"Content-Length": content_length, "Content-Type": content_type} 
        
               return headers, ByteStream(body)

httpx/httpx/_utils.py

Lines 56 to 68 in 4b85e6c

    
           def primitive_value_to_str(value: PrimitiveData) -> str: 
        
               """ 
        
               Coerce a primitive data type into a string value. 
        
               Note that we prefer JSON-style 'true'/'false' for boolean values here. 
        
               """ 
        
               if value is True: 
        
                   return "true" 
        
               elif value is False: 
        
                   return "false" 
        
               elif value is None: 
        
                   return "" 
        
               return str(value)

The text was updated successfully, but these errors were encountered:

trim21 · 2024-05-31T13:38:16Z

there is also another issue, query string should support bytes natively.

for example:

import httpx
import urllib.parse

r = httpx.get(
    "https://httpbin.org/get",
    params={"q": bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")},
)

print(urllib.parse.quote(bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")))

print(r.text)

expected url should be https://httpbin.org/get?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E, instead of ValueError or TypeError, or currently https://httpbin.org/get?q=b\"\\xe1\\xee\\x0e'4\\x98oT\\x19\\xbbl\\xb6%25%2B\\xd97q\\x83D\\x0e\"

tomchristie · 2024-06-03T10:43:17Z

query string should support bytes natively.

I'm not convinced by that. Why?

trim21 · 2024-06-03T10:47:31Z

query string should support bytes natively.

I'm not convinced by that. Why?

because percent encoding can take native bytes as input

trim21 · 2024-06-03T10:53:08Z

when we say it support "string", it's actually assuming percent encoded bytes is utf8 content.

consider this:

from urllib.parse import quote, quote_from_bytes


utf8_s = "你好"

assert quote(utf8_s) == quote_from_bytes(utf8_s.encode())

https://example.com/?q=你好 is actually https://example.com/?q=%E4%BD%A0%E5%A5%BD

that's also what urllib.parse.quote is doing by default, it encode str (utf8 by default) to bytes then use quote_from_bytes

tomchristie · 2024-06-03T11:18:33Z

I think this is confusing a couple of separate issues...

Should str-coercion be performed.
Should bytes be a supported type.

I'm think there's a good case that we shouldn't str-coerce types outside of the expected range. I don't currently think there's a good case that we should support bytes.

trim21 · 2024-06-03T11:24:02Z

I would vote for both...

consider this:

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

looks fine, right?

but actually request url is ?q=%EF%BF%BD%EF%BF%BD%0E%274%EF%BF%BDoT%19%EF%BF%BDl%EF%BF%BD%25%2B%EF%BF%BD7q%EF%BF%BDD%0E&a=1, which it not right.

correct value of q is bytes.fromhex('e1ee0e2734986f5419bb6cb6252bd9377183440e'), but httpx send url with q bytes.fromhex('efbfbdefbfbd0e2734efbfbd6f5419efbfbd6cefbfbd252befbfbd3771efbfbd440e')

also send bytes as query value is supported by requests， which is handled correctly

cknv · 2024-06-03T11:34:29Z

Just to chip in a bit here:

I have been using httpx to interact with a legacy system and as test client to mimic that legacy system calling back into our starlette based API.
Some of the query fields can occasionally be bytes, these bytes are either just bytes that only has meaning to systems outside of my control, or UTF-16 encoded strings (which as an aside can also mean that I have to produce a mixed content query string).

In order to get the correct values through httpx and not 'b"value"' I have had to resort to manually building the query myself with the stdlib and appending it on the URL.
This does work, but it took some time to arrive at and I wouldn't exactly call it "ergonomic".

But I also realize that my case is not common.

tomchristie · 2024-06-03T16:57:58Z

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

trim21 · 2024-06-03T17:01:42Z

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

yes, that's a problem, I think.

if quote_from_bytes(unquote_to_bytes("%EE")) then everything is fine, in this case.

tomchristie · 2024-06-03T17:10:25Z

@cknv Could you give a (simplified) example please?

cknv · 2024-06-03T19:08:45Z

Sure @tomchristie! Get ready for some flashback to the dark ages, before UTF8 won the encoding wars.

from urllib.parse import urlencode
import httpx

client = httpx.Client()

values = {
    "expected-as-utf8": "Hello world",
    "expected-as-int": 1234,
    "expected-as-latin1": "Hello world".encode("latin1"),
    "expected-as-binary": b"\x01\x02\x03\x04",
}

encoded_query = urlencode(values)

# replaced .get with .build_request
# as making requests does not seem important for demonstrating my workaround
correctly_encoded_request = client.build_request(
    method="GET",
    url=f"http://localhost?{encoded_query}",
)
incorrecty_encoded_request = client.build_request(
    method="GET",
    url="http://localhost",
    params=values,
)

# compare and contrast:
print(correctly_encoded_request.url.params)
print(incorrecty_encoded_request.url.params)

When actually rereading the code, it wasn't UTF-16 that was used, but instead actually latin1, so I replaced that in the example, although either of them would need to be encoded into bytes.

Narrowing our focus to just the plain byte value, it should be expected-as-binary=%01%02%03%04, and my workaround indeed does produce that.
Without the workaround and just feeding the values dict into the request builders params the value is reprd (via str) and produces expected-as-binary=b%27%5Cx01%5Cx02%5Cx03%5Cx04%27 which is the percent encoded version of expected-as-binary=b'\x01\x02\x03\x04'.

The latin1 value works without the workaround as long as you stick to the subset that is identical with ASCII, but we do at times get data through that isn't just that subset.

To add some context the server receiving the request is written in C, and has a very lax attitude to encodings because it's all just bytes anyway.

If httpx would percent encode bytes, I would probably be able to remove that workaround. However I would not be surprised if there are other ways to send bytes over http either established in an RFC or just by tradition, if so I would not know which one is the most appropriate, but it appears like percent encoding is at least in the RFC.

Finally, I would like to reiterate that I know that this is way outside the mainstream, I have mostly accepted my workaround, but I at least wanted to raise my hand when someone else expressed a similar problem.

trim21 changed the title ~~add more type checking in headers/params/form-data~~ add more type checking in params/form-data Apr 17, 2024

trim21 mentioned this issue Apr 17, 2024

fix: add type checking in form data #3173

Open

3 tasks

trim21 mentioned this issue May 31, 2024

feat: support bytes as query value #3212

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add more type checking in params/form-data #3170

add more type checking in params/form-data #3170

trim21 commented Apr 17, 2024 •

edited

trim21 commented May 31, 2024 •

edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

cknv commented Jun 3, 2024 •

edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

tomchristie commented Jun 3, 2024

cknv commented Jun 3, 2024

add more type checking in params/form-data #3170

add more type checking in params/form-data #3170

Comments

trim21 commented Apr 17, 2024 • edited

Problem

Proposal

form

query

multipart_data

trim21 commented May 31, 2024 • edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited

cknv commented Jun 3, 2024 • edited

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited

tomchristie commented Jun 3, 2024

cknv commented Jun 3, 2024

trim21 commented Apr 17, 2024 •

edited

trim21 commented May 31, 2024 •

edited

trim21 commented Jun 3, 2024 •

edited

trim21 commented Jun 3, 2024 •

edited

cknv commented Jun 3, 2024 •

edited

trim21 commented Jun 3, 2024 •

edited