Bad UTF-8 "To" header encoding? #369

andresmrm · 2024-04-12T19:46:23Z

Hi!

I'm getting errors when trying to send messages when the "to" header has non-ASCII chars.
The problem seems to happen when all these conditions are true at the same time:

the readable name part of the recipient is big (e.g. "A very long and big name for this recipient" to@example.com)
there is at least one non-ASCII char
there is at least one "special char" (only seems to happen with comma or parenthesis)
the special char isn't close to or between no-ASCII chars

I will use the related test to better explain:

Description	Input	Output	Error?
Current test	`"Người nhận" <to@example.com>`	`To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= <to@example.com>`	No
Long	`"Người nhận a very very long name" <to@example.com>`	`To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long name <to@example.com>`	No
With comma	`"Người nhận a very very long, name" <to@example.com>`	`To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>`	YES
Comma between non-ASCII	`"Người nhận a very very long, náme" <to@example.com>`	`To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcsIG7DoW1l?=\n <to@example.com>`	No
Comma near non-ASCII	`"Người nhận, name" <to@example.com>`	`To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uLCBuYW1l?= <to@example.com>`	No

So, if we have a UTF-8 encoded part and a special char, the special char must also be encoded, but currently this is only happening if the special char is close or between non-ASCII chars.

Commenting this line seems to encode the entire name in these cases, solving the problem. But I don't know if this has other unwanted consequences.

Anymail version: 10.3
ESP: Amazon SES
Versions of Django 5.0.3, requests 2.30.0, python 3.11.8
Exact error message and/or stack trace

Example traceback:

File ~/.local/lib/python3.11/site-packages/anymail/backends/amazon_ses.py:78, in EmailBackend._send(self, message)
     76 def _send(self, message):
     77     if self.client:
---> 78         return super()._send(message)
     79     elif self.fail_silently:
     80         # (Probably missing boto3 credentials in open().)
     81         return False

File ~/.local/lib/python3.11/site-packages/anymail/backends/base.py:147, in AnymailBaseBackend._send(self, message)
    144     return False
    146 payload = self.build_message_payload(message, self.send_defaults)
--> 147 response = self.post_to_esp(payload, message)
    148 message.anymail_status.esp_response = response
    150 recipient_status = self.parse_recipient_status(response, payload, message)

File ~/.local/lib/python3.11/site-packages/anymail/backends/amazon_ses.py:114, in EmailBackend.post_to_esp(self, payload, message)
    110     response = client_send_api(**payload.params)
    111 except BOTO_BASE_ERRORS as err:
    112     # ClientError has a response attr with parsed json error response
    113     # (other errors don't)
--> 114     raise AnymailAPIError(
    115         str(err),
    116         backend=self,
    117         email_message=message,
    118         payload=payload,
    119         response=getattr(err, "response", None),
    120     ) from err
    121 return response

AnymailAPIError: An error occurred (BadRequestException) when calling the SendEmail operation: Local address contains control or whitespace
botocore.errorfactory.BadRequestException: An error occurred (BadRequestException) when calling the SendEmail operation: Local address contains control or whitespace

The text was updated successfully, but these errors were encountered:

medmunds · 2024-04-12T23:15:21Z

Thanks for the report and the detailed analysis. I'm able to reproduce this and am investigating. [I hope you don't mind, I edited your report to format the input and output columns as code, because GitHub was hiding important characters essential to understanding the problem.]

It looks like you've uncovered a bug in Python's email package, related to incorrectly "folding" address headers that are too long, when using Content-Transfer-Encoding (CTE) 7bit, and if the headers need "encoded words" and also contain "special characters." I haven't been able to locate a Python bug report for this exact problem, but a similar issue with shorter address headers that didn't require folding (python/cpython#81663) was fixed in Python 3.8. I'm guessing they missed the folding case.

Anymail's SES backend needs to use CTE 7bit (the line of code you identified), because Amazon SES doesn't officially support 8bit CTE, and using it can result in mojibake depending on what other SES options are enabled. (See the comments above that code and Anymail issue #115.)

I'll look into workarounds…

It seems like Amazon has relaxed their stance on 8bit "in some cases," though still recommends 7bit for anything non-ASCII. We could add an Anymail option to choose 8bit CTE (and risk mojibake) or 7bit CTE (and risk broken address headers).
Also, I'm not sure how widespread support is for 8bit email. (Gmail handles it fine, but do you have any recipients still using Outlook 2013? Or some ancient enterprise email gateway appliance?)
Allowing longer lines would solve the problem (no folding needed), but has similar compatibility concerns to using 8bit (or worse).
If we can come up with a fix for the Python email package, Anymail might be able to include that code and use it to override broken address header serialization. (This would be my preference.)

andresmrm · 2024-04-13T11:01:58Z

Thanks for the quick and complete response, and for improving the table in my post! I'll adapt the test case so it can be reproduced without Anymail and report at CPython. About the workarounds, it seemed hard to fix the issue without monkey patching, something I would like to avoid since sending email is a crucial part of my application. So, for now, I was just going to remove commas and parenthesis from the addresses...

medmunds · 2024-04-13T18:56:04Z

Hmm, looking into this some more, I think it's actually a Django bug. And there's a reasonable workaround Anymail could implement.

Python's email.message.EmailMessage class handles the header correctly. But I see the bug when using django.core.mail.EmailMessage:

import email.message
import email.policy
import django.core.mail

to = '"Người nhận a very very long, name" <to@example.com>'
policy = email.policy.default.clone(cte_type="7bit")

# Python's EmailMessage class doesn't exhibit bug
msg1 = email.message.EmailMessage()
msg1["To"] = to
print(msg1.as_bytes(policy=policy).decode("ascii"))
# To:
#   =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcs?= name <to@example.com>

# Django's EmailMessage class has bug
msg2 = django.core.mail.EmailMessage(to=[to]).message()
msg2.policy = policy
print(msg2.as_bytes().decode("ascii"))
# [... other headers ...]
# To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>

Django's EmailMessage.message() builds a Python legacy compatibility email.message.Message (wrapped as a Django SafeMIMEText). I believe the problem starts when django.core.mail.message.sanitize_address calls Header().encode() without specifying a maxlinelen for the header. This results in a single, very long encoded word for the entire display name. Python tries to refold it while serializing the message, but bugs in the legacy code introduce the error. (I think.) (So technically, this may be a Python bug, but it's in legacy code that won't be fixed. Also, it's probably related to django#31784.)

I think Anymail should just stop using Django's EmailMessage.message(), and instead build its own modern email.message.EmailMessage object directly from the Django EmailMessage. There are a lot of problems in the old Python Message code that were fixed in Python ~3.3-3.5's email revamp. (I also think Django should update to the newer class, too, but that's a different discussion.)

Before switching, we'll need to investigate whether there's anything important in Django's SafeMIME classes we'd be losing or need to copy. I suspect a lot of SafeMIME is there solely to work around bugs and security concerns in Python's legacy Message code—issues that don't apply to Python's modern EmailMessage. But I haven't really looked through that part of Django's mail package in a while.

andresmrm · 2024-04-15T12:27:54Z

Last Saturday I couldn't reproduce the bug using only Python's email pkg. So I went back trying to isolate it, and it seems to be in sanitize_address (as you said). But I hadn't the time to finish my analysis and report back here...

I investigated a bit more now and it really seems a conflict between Django and Python behavior, as you said.
Django's forbid_multi_line_headers (sanitize_address is called by it) already does the encoding. This:

"A véry long name with non-ASCII char and, comma"

Becomes this:

=?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?=

But later, generator.BytesGenerator.flatten reencodes it wrongly:

A =?utf-8?q?v=C3=A9ry?= long name with non-ASCII char and, comma

If we comment the forbid_multi_line_headers line, letting the flatten handle the original string, it gives this:

A =?utf-8?q?v=C3=A9ry_long_name_with_non-ASCII_char_and=2C?= comma

What seems valid. Even if I really increase the size of the string, so the non-ASCII and the comma stay in different lines, it still works:

A =?utf-8?q?v=C3=A9ry?= long long long long long long long long long long\n long long long long long long long long long long long long long long long\n long long long long name with non-ASCII char =?utf-8?q?and=2C?= comma

So the problem only happens when both functions try to encode the string.
That said, they encode it a bit differently: the first encodes the entire name, the second only the required parts. Not sure if this is relevant.

Should I report it both at Python and Django?

medmunds · 2024-04-15T23:12:41Z

Should I report it both at Python and Django?

I'd report it to Django only. I'm pretty sure the problem only occurs with Python email's legacy email.message.Message, which uses a different folding algorithm (via email.policy.compat32) than modern email.message.EmailMessage (email.policy.default) uses.

My understanding is the Compat32 legacy layer is there specifically to replicate Python 2's email behavior (including any bugs), so there's not much point in reporting bugs against it.

andresmrm · 2024-04-16T14:51:33Z

Done! Feel free to add more info there.

medmunds added bug esp:Amazon SES not our bug Bug, but in ESP or third-party code labels Apr 12, 2024

sarahboyce mentioned this issue Apr 30, 2024

Fixed #35378 -- Added maximum header length for long addresses. django/django#18110

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad UTF-8 "To" header encoding? #369

Bad UTF-8 "To" header encoding? #369

andresmrm commented Apr 12, 2024 •

edited by medmunds

medmunds commented Apr 12, 2024

andresmrm commented Apr 13, 2024 via email

medmunds commented Apr 13, 2024

andresmrm commented Apr 15, 2024

medmunds commented Apr 15, 2024

andresmrm commented Apr 16, 2024

Bad UTF-8 "To" header encoding? #369

Bad UTF-8 "To" header encoding? #369

Comments

andresmrm commented Apr 12, 2024 • edited by medmunds

medmunds commented Apr 12, 2024

andresmrm commented Apr 13, 2024 via email

medmunds commented Apr 13, 2024

andresmrm commented Apr 15, 2024

medmunds commented Apr 15, 2024

andresmrm commented Apr 16, 2024

andresmrm commented Apr 12, 2024 •

edited by medmunds