feature - avoid utf-8 decoding for text frames #1376

toppk · 2023-06-26T05:02:06Z

just because it is supposed to be in utf-8, doesn't mean I prefer it in that form. specifically, my usecase, is giving the data to orjson, and passing it around as an orjson.Fragment().

Here are the documents for that use case.

https://github.com/ijl/orjson#deserialize
https://github.com/ijl/orjson#fragment

looking at websockets code, if such a capability were to be implemented, it seems like we'd want to add an flag to WebSocketCommonProtocol() and then use it to force binary around the time it decides on whether to decode it or not, located here:

https://github.com/python-websockets/websockets/blob/main/src/websockets/legacy/protocol.py#L1053

I'd be happy to whip up a patch in case you would consider this feature request.

aaugustin · 2023-06-26T10:07:12Z

I understand your use case and, indeed, you cannot do this with the current API.

For receiving frames, it would mean an API like websocket.recv(decode_text_frames=False). (Naming TBC.) Can you confirm that it's what you want? (Then, you get bytes in all cases so you cannot tell if it was a Text or Binary frame in the first plac; but you don't really care anyway.)

This raises the question of providing a symmetrical API for sending bytes (assumed to be valid UTF-8) as a Text frame. You didn't ask for this but I'd like to keep consistency between both sides.

toppk · 2023-06-26T19:02:27Z

That would work quite well. I guess I misunderstood the code, because it looked to me as if the recv() method is decoupled from where the actual processing of inbound data (read_message()). The solution you propose would certainly be more flexible.

toppk · 2023-06-28T02:48:11Z

just thinking about the send side, I think it really is less important. there aren't too many servers that are strict in what they accept, especially when they are expecting text. I think if we implement it for send, while the effect is the same (skip encode, skip decode), but the names of the options will be different, e.g: decode_text_frames=False for recv(), and send_as_text=True for send()

aaugustin · 2023-06-28T07:03:36Z

Yes, we need to pick the names for both sides carefully and, ideally, consistently.

raw_utf8 is a name that could work for both sides. I'm not sure it's the best name we can find, though.

If we have two names, I'd like some symmetry e.g. using the words decode and encode.

carlos-sarmiento · 2023-09-20T17:52:41Z

I'm finding myself in the same position, trying to send data encoded with orjson as a text frame even when it is provided to websockets in binary form.

Any chance this gets added?

aaugustin added the enhancement label Jun 26, 2023

aaugustin added the high priority label Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature - avoid utf-8 decoding for text frames #1376

feature - avoid utf-8 decoding for text frames #1376

toppk commented Jun 26, 2023

aaugustin commented Jun 26, 2023

toppk commented Jun 26, 2023 •

edited

toppk commented Jun 28, 2023

aaugustin commented Jun 28, 2023

carlos-sarmiento commented Sep 20, 2023

feature - avoid utf-8 decoding for text frames #1376

feature - avoid utf-8 decoding for text frames #1376

Comments

toppk commented Jun 26, 2023

aaugustin commented Jun 26, 2023

toppk commented Jun 26, 2023 • edited

toppk commented Jun 28, 2023

aaugustin commented Jun 28, 2023

carlos-sarmiento commented Sep 20, 2023

toppk commented Jun 26, 2023 •

edited