JSON parser breaks in presence of UTF BOM #9849

radeusgd · 2024-05-02T17:34:01Z

I just tried reading a JSON file that I wrote using PowerShell and got the following failure:

(Error: (Corrupted_Format (File_Like.Value (File foo.json)) 'Parse error in parsing JSON: Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2].' (Invalid_JSON.Error 'Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2]')))

The 0xfeff character seems to be the problem.

I imagine we need to strip whitespace before parsing.

A smallest repro to see the bug is:

'\ufeff{}'.parse_json

I think Enso should be able to handle files encoded with BOM.

We may want to revise how our CSV parser handles such cases too.

Changes

By default, we should read the BOM when reading text in from a stream.
Our default "charset" should be cleverer and use the BOM to determine which Unicode charset to use.
If the default "charset" encounters an invalid character then we should fallback to Windows-1252.
This should be the case for reading plain text, delimited tables, JSON and XML.

The error reporting should only report a few of the failing indexes (3) and a count of the total failures.

The text was updated successfully, but these errors were encountered:

enso-bot · 2024-05-23T08:31:35Z

Radosław Waśko reports a new STANDUP for yesterday (2024-05-22):

Progress: Fixed the failing test and got the union PR merged. Introduced Encoding.Default and set it as default for relevant read operations. Added tests for BOM handling and detection heuristics. It should be finished by 2024-05-28.

Next Day: Next day I will be working on the same task. Work on BOM detection. Investigate Datalinks issue. Types work.

radeusgd added --bug Type: bug l-readdata -libs Libraries: New libraries to be implemented labels May 2, 2024

AdRiley assigned jdunkerley May 7, 2024

jdunkerley assigned radeusgd and unassigned jdunkerley May 14, 2024

jdunkerley added this to the Beta Release milestone May 14, 2024

enso-bot bot mentioned this issue May 22, 2024

Update Table.union #9952

Closed

enso-bot bot mentioned this issue May 27, 2024

Prototype for type inference IR pass #8590

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON parser breaks in presence of UTF BOM #9849

JSON parser breaks in presence of UTF BOM #9849

radeusgd commented May 2, 2024 •

edited by jdunkerley

enso-bot bot commented May 23, 2024

JSON parser breaks in presence of UTF BOM #9849

JSON parser breaks in presence of UTF BOM #9849

Comments

radeusgd commented May 2, 2024 • edited by jdunkerley

Changes

enso-bot bot commented May 23, 2024

radeusgd commented May 2, 2024 •

edited by jdunkerley