Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non UTF-8 json input #75

Open
Tracked by #66
wbprime opened this issue Apr 19, 2024 · 5 comments
Open
Tracked by #66

Add support for non UTF-8 json input #75

wbprime opened this issue Apr 19, 2024 · 5 comments

Comments

@wbprime
Copy link

wbprime commented Apr 19, 2024

Is your feature request related to a problem? Please describe.

sonic-rs would fail if the input bytes contain non UTF-8 characters, even for pub fn from_slice<'a, T>(json: &'a [u8]) function. However, there exists cases bytes containning non UTF-8 json need serialize/deserialize support, typically encoding GBK/GB18030 in China.

Describe the solution you'd like

  • add support for non UTF-8 encoded json bytes in from_slice function
  • or drop from_slice function

Describe alternatives you've considered

N/A.

Additional context
N/A.

@PureWhiteWu
Copy link
Member

Hello, according to the json rfc, unicode encoding is enforced.

image

Furthermore, does other json library such as serde_json, simd_json support non utf-8 input?

@wbprime
Copy link
Author

wbprime commented Apr 24, 2024

@PureWhiteWu sorry for late reply.

serde_json can deserialize non UTF-8 bytes. simd_json not tested.

Aware your design principle to adhere to JSON std. However, UTF-8 is not the only encoding impl of unicode. Say, if UTF-16 support is on your roadmap, maybe other non unicode encoding support could be simply achieved with little effort I guess.

Moreover, JSON std suggests support non UTF-8 encoding as an impl extension.

Last words: GBK/GB18030 encoding is much like UTF-8 keeping compatible with ASCII making it easy to support.

Thanks

@liuq19
Copy link
Collaborator

liuq19 commented Apr 25, 2024

@PureWhiteWu sorry for late reply.

serde_json can deserialize non UTF-8 bytes. simd_json not tested.

Aware your design principle to adhere to JSON std. However, UTF-8 is not the only encoding impl of unicode. Say, if UTF-16 support is on your roadmap, maybe other non unicode encoding support could be simply achieved with little effort I guess.

Moreover, JSON std suggests support non UTF-8 encoding as an impl extension.

Last words: GBK/GB18030 encoding is much like UTF-8 keeping compatible with ASCII making it easy to support.

Thanks

Thanks, could you give a test case with code? I know serde_json will only not fail when parsing invalid UTF-8 into bytes.

@wbprime
Copy link
Author

wbprime commented Apr 26, 2024

@liuq19 See this repository for your convenience.

@liuq19
Copy link
Collaborator

liuq19 commented Apr 26, 2024

Thanks, we will investigate it

@liuq19 liuq19 mentioned this issue May 23, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants