Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.text and r.json() return different results *in some cases* #5667

Closed
jjmaldonis opened this issue Nov 24, 2020 · 3 comments · Fixed by #5673
Closed

r.text and r.json() return different results *in some cases* #5667

jjmaldonis opened this issue Nov 24, 2020 · 3 comments · Fixed by #5673

Comments

@jjmaldonis
Copy link
Contributor

This might be an interesting one... I found that r.text and r.json() can return different results in some specific cases. I don't understand why the difference between the two cases is changing the result of r.text.

Maybe r.text should always default to using utf-8 as the decoding if application/json is set as the response's content type, following https://www.ietf.org/rfc/rfc4627.txt (ctrl+f for JSON text SHALL be encoded in Unicode.).

I'm using the latest version of requests.

Expected Result

I would expect r.text and r.json() to return ~ the same thing. More specifically, I would expect json.loads(r.text) and r.json() to return the same thing, but the issue seems to be with r.text's decoding specifically.

Actual Result

In the following code, I am making a sample request and replacing the request's response with custom content so we have full control over it. The custom content is utf-8 encoded. In the next version of this code, you'll see the name change when it shouldn't.

import requests
import json

r = requests.get("https://api.covidtracking.com/v1/us/current.json")
r._content = b'{"name":"rd\xce\xba"}'  # This is utf-8
r.headers = {
    "Content-Type": "application/json",
}
print(r.json())
print(json.loads(r.text))

The above code prints:

{'name': 'rdκ'}
{'name': 'rdκ'}

which is fantastic.

Replacing the request's content with b'{"name":"rd\xce\xba","uuid":"1234"}', which simply adds a uuid field to the JSON, and running the code again prints:

{'name': 'rdκ', 'uuid': '1234'}
{'name': 'rdκ', 'uuid': '1234'}

The name is different even though it did not change at all! The existence of "uuid":"1234" in the response's contents somehow changes the decoding. I have no clue why.

Reproduction Steps

Run this code:

import requests
import json

r = requests.get("https://api.covidtracking.com/v1/us/current.json")
r._content = b'{"name":"rd\xce\xba","uuid":"1234"}'
r.headers = {
    "Content-Type": "application/json",
}
print(r.json())
print(json.loads(r.text))

The issue should be fixed when the two print statements match... I think.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.7"
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.4"
  },
  "platform": {
    "release": "10",
    "system": "Windows"
  },
  "pyOpenSSL": {
    "openssl_version": "1010103f",
    "version": "19.0.0"
  },
  "requests": {
    "version": "2.25.0"
  },
  "system_ssl": {
    "version": "1010104f"
  },
  "urllib3": {
    "version": "1.24.2"
  },
  "using_pyopenssl": true
}
@jjmaldonis
Copy link
Contributor Author

jjmaldonis commented Nov 24, 2020

I looked into this some more and the reason for the difference is that chardet is guessing different text encodings for the two different strings.

I propose that, based on RFC 4627, if the response's content type is set to application/json that utf be used as the encoding by default rather than using chardet to guess the encoding. You already have a guess_json_utf function that returns the utf type (utf-8, -16, or -32) and I believe you can just reuse that: encoding = guess_json_utf(self.content).

I'm happy to submit a PR for this if you want.

RFC 4627: https://www.ietf.org/rfc/rfc4627.txt -- Ctrl+f for JSON text SHALL be encoded in Unicode.

@sigmavirus24
Copy link
Contributor

So there's a semantic difference between .json() and .text. For one thing, I'd love for .text to not be a thing at all. Most of the time we guess wrong. If we dropped .text we wouldn't need chardet and we could finally end the ridiculous arguments about LGPL being involved in Requests.

Also, given the fact that .text does things while decoding bytes (like replace invalid sequences), I'd argue it should be avoided at all costs due to the likelihood of data corruption.

All this said, get_encoding_from_headers could probably smarter about application/json

jjmaldonis added a commit to jjmaldonis/requests that referenced this issue Nov 29, 2020
…pe is set to application/json, following RFC 4627.

fixes psf#5667
@jjmaldonis
Copy link
Contributor Author

That's too bad about the issues with .text. If only it were possible to drop stuff like that. I totally agree with where you're coming from.

I just made a small PR based on what you said, it's #5673. I'm happy to make any changes you want.

laggardkernel added a commit to laggardkernel/requests that referenced this issue Mar 20, 2021
New encoding priority

1. Response.encoding from charset
2. Response.apparent_encoding
    1. using guess strategy for specific content-type
    2. general guess strategy by chardet.detect

another fix for psf#5667
laggardkernel added a commit to laggardkernel/requests that referenced this issue Mar 20, 2021
New encoding priority

1. Response.encoding from charset
2. Response.guess_and_decode()
    1. using guess strategy for specific content-type
    2. general guess strategy by chardet.detect

another fix for psf#5667
jonfung-scale pushed a commit to jonfung-scale/jonfung-requests that referenced this issue Jul 14, 2021
…pe is set to application/json, following RFC 4627.

fixes psf#5667
jonfung-scale pushed a commit to jonfung-scale/jonfung-requests that referenced this issue Jul 14, 2021
…pe is set to application/json, following RFC 4627.

fixes psf#5667
jonfung-scale pushed a commit to jonfung-scale/jonfung-requests that referenced this issue Jul 15, 2021
…pe is set to application/json, following RFC 4627.

fixes psf#5667
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants