Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parser breaks in presence of UTF BOM #9849

Open
radeusgd opened this issue May 2, 2024 · 1 comment
Open

JSON parser breaks in presence of UTF BOM #9849

radeusgd opened this issue May 2, 2024 · 1 comment
Assignees
Labels
--bug Type: bug -libs Libraries: New libraries to be implemented l-readdata
Milestone

Comments

@radeusgd
Copy link
Member

radeusgd commented May 2, 2024

I just tried reading a JSON file that I wrote using PowerShell and got the following failure:

(Error: (Corrupted_Format (File_Like.Value (File foo.json)) 'Parse error in parsing JSON: Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2].' (Invalid_JSON.Error 'Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2]')))

The 0xfeff character seems to be the problem.

I imagine we need to strip whitespace before parsing.

A smallest repro to see the bug is:

'\ufeff{}'.parse_json

I think Enso should be able to handle files encoded with BOM.

We may want to revise how our CSV parser handles such cases too.

Changes

By default, we should read the BOM when reading text in from a stream.
Our default "charset" should be cleverer and use the BOM to determine which Unicode charset to use.
If the default "charset" encounters an invalid character then we should fallback to Windows-1252.
This should be the case for reading plain text, delimited tables, JSON and XML.

The error reporting should only report a few of the failing indexes (3) and a count of the total failures.

@radeusgd radeusgd added --bug Type: bug l-readdata -libs Libraries: New libraries to be implemented labels May 2, 2024
@jdunkerley jdunkerley assigned radeusgd and unassigned jdunkerley May 14, 2024
@jdunkerley jdunkerley added this to the Beta Release milestone May 14, 2024
@enso-bot enso-bot bot mentioned this issue May 22, 2024
@enso-bot
Copy link

enso-bot bot commented May 23, 2024

Radosław Waśko reports a new STANDUP for yesterday (2024-05-22):

Progress: Fixed the failing test and got the union PR merged. Introduced Encoding.Default and set it as default for relevant read operations. Added tests for BOM handling and detection heuristics. It should be finished by 2024-05-28.

Next Day: Next day I will be working on the same task. Work on BOM detection. Investigate Datalinks issue. Types work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--bug Type: bug -libs Libraries: New libraries to be implemented l-readdata
Projects
Status: 🔧 Implementation
Development

No branches or pull requests

2 participants