Fix `trimValueWhitespaces` removing needed white-spaces #226

MartinP7r · 2021-08-14T07:52:01Z

Hello everyone!

I found that the logic for trimming white-spaces (trimValueWhitespaces) may inadvertently remove white-spaces from inside the string value of the element, if XMLParser reports the content of a single element as multiple strings.

This happens for example with non-ascii characters like German umlauts:
Lösen is split up and reported separately as L and ösen.
Or where we have "mixed" strings, e.g. containing a segment of Japanese or Chinese characters:
Japanese 日本語 is split into Japanese and 日本語.

If there's a white-space adjacent to the split, as in the Japanese example above, the white-space gets trimmed before it's joined together. Resulting in Japanese日本語 without a space in the middle.

I added failing tests with the first commit.

see also: https://stackoverflow.com/a/1076951/2064473

The parser object may send the delegate several parser:foundCharacters: messages to report the characters of an element. Because string may be only part of the total character content for the current element, you should append it to the current accumulation of characters until the element changes.

see: https://stackoverflow.com/a/1076951/2064473 "The parser object may send the delegate several parser:foundCharacters: messages to report the characters of an element. Because string **may be only part of the total character content for the current element**, you should append it to the current accumulation of characters until the element changes."

MaxDesiatov · 2021-08-14T09:35:22Z

@wooj2 Since you've introduced this in the first place, WDYT about this change?

MartinP7r · 2021-08-16T00:31:25Z

Thank you for having a look, @MaxDesiatov .

I have an idea for a patch, but will wait on @wooj2 to weigh in before I spend time on it.

Refactor `process()` function name to `trimWhitespacesIfNeeded()`

MartinP7r · 2021-08-20T04:22:32Z

I added a possible fix which seems to be not to intrusive.

It seems like a commit in #92 caused the change and added process(string:) in parser(_:foundCharacters:).
I kept it there for the relevant check for emptiness only and moved the actual trimming into parser(_:didEndElement:namespaceURI:qualifiedName:) where it's sure that no further strings will be appended.

Tests are passing, but please let me know if you can think of possible regressions.

codecov · 2021-08-20T04:26:44Z

Codecov Report

Merging #226 (48393da) into main (887de88) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #226      +/-   ##
==========================================
+ Coverage   73.94%   74.00%   +0.06%     
==========================================
  Files          46       46              
  Lines        2437     2443       +6     
==========================================
+ Hits         1802     1808       +6     
  Misses        635      635

Impacted Files	Coverage Δ
Sources/XMLCoder/Auxiliaries/XMLCoderElement.swift	`96.55% <100.00%> (+0.06%)`	⬆️
Sources/XMLCoder/Auxiliaries/XMLStackParser.swift	`94.05% <100.00%> (+0.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 887de88...48393da. Read the comment docs.

wooj2 · 2021-08-20T21:07:01Z

Thank you for having a look, @MaxDesiatov .

I have an idea for a patch, but will wait on @wooj2 to weigh in before I spend time on it.

Sorry for the delay. There was some confusion around trimValueWhitespaces in that I was mistakenly identified as the one who introduced this flag. That is not the case :).

If I understand the problem correctly, it seems that in the case of "Stilles Örtchen", parser(XMLParser, foundCharacters:) is getting called once w/ the value of "Stilles " and subsequently with "Örtchen". When trimValueWhitespaces is set to true, the XMLParser delegate incorrectly strips out the space between s and Ö.

Looking at your approach for a fix, I think applying trimValueWhitespaces to the value of the element after the entire element's value has been parsed(parser(XMLParser, didEndElement:...)) is elegant, and seems to make sense. The only comment that I would have is maybe adding some test cases

Nested case:

private let deCharacterXMLNested = """
<si><t xml:space="preserve">\(deCharacterString)</t></si>
""".data(using: .utf8)!

...snip...

    func testNonStandardCharactersNested() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: true)

        let resultGerman = try XMLDecoder().decode(Item.self, from: deCharacterXMLNested)

        XCTAssertEqual(resultGerman.text, deCharacterString)
    }

Spaces preserved:

private let deCharacterXMLSpaced = """
<t>     \(deCharacterString)    </t>
""".data(using: .utf8)!

...snip...

    func testNonStandardCharactersSpaced() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: false)

        let resultGerman = try decoder.decode(String.self, from: deCharacterXMLSpaced)

        XCTAssertEqual(resultGerman, "     \(deCharacterString)    ")
    }

I'll defer to @MaxDesiatov for the final approval and/or any other comments.

@wooj2

for nested xml and untrimmed case thanks to @wooj2 for the suggestion!

MartinP7r · 2021-08-21T03:01:52Z

Sorry for the confusion @wooj2!
Thank you for having a look at my solution and especially for suggesting the additional tests.
I think I don't fully understand the XML specification for whitespace preservation, yet, so I highly appreciate it!
I went ahead and added the tests. I hope that's OK.

If I may, I have another slightly different question about whitespaces in XML.
(I can open a new issue if that makes more sense.)

In http://www.xmlplease.com/xml/xmlspace/ under 3. Whitespace only text nodes it says that something like

<root xml:space="preserve">
    <test> This is     great. </test>
</root>

would have the consecutive spaces between "is" and "great" (...) be reduced to just one space character. like in normal HTML parsing. This doesn't seem to be the case for the current implementation of XMLCoder.
I.e. a test as below

private let multipleSpacesBetween = "A      word"
private let multipleSpacesBetweenXML = """
<t>\(multipleSpacesBetween)</t>
""".data(using: .utf8)!

...snip...

    func testMultipleSpacesBetween() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: true)

        let resultJapanese = try decoder.decode(String.self, from: multipleSpacesBetweenXML)

        XCTAssertEqual(resultJapanese, "A word")
    }

produces

XCTAssertEqual failed: ("A      word") is not equal to ("A word")

I'm sorry if I've missed something, but could you tell me if that is the way it should be or might this warrant a fix?

Best regards,
Martin

MaxDesiatov · 2021-08-21T20:47:28Z

Sorry, I'm a bit confused about XCTAssertEqual failed: ("A word") is not equal to ("A word"). Does this make any sense to you? These strings seem to be equal to each other at a first glance. Am I missing something? Is the comparison flawed, or is there some other whitespace character there that makes these strings not equal to each other?

MartinP7r · 2021-08-22T01:36:18Z

I'm sorry! I did not realize that GitHub's markdown parser processes whitespace differently for code in single-backticks than for three-backticks. So, the single-backtick version reduces consecutive whitespaces into one, just like normal HTML (which is ironically just what I was asking about and the documentation I linked is about).

I edited it to use three-backticks, so the error message displays correctly now.

XCTAssertEqual failed: ("A      word") is not equal to ("A word")

edit:
As I said, this question is probably unrelated to the current PR.
I just wanted to ask whether or not this is working as intended or if it should be as described in the linked documentation. I'd be happy to open a separate issue or PR with the failing tests above for this if you think it's relevant.

MaxDesiatov · 2021-08-22T11:13:38Z

Stuff like this related to the actual low level parsing of XML is a question about Foundation's XMLParser behavior. We only provide helpers on top of it.

I wonder how with this caveat in the standard one is supposed to encode or decode any strings with multiple spaces in them? The fact that multiple whitespaces aren't currently collapsed into one is either a happy accident or a deliberate decision made by Foundation authors. Either way, if that's not directly related to this PR, it does make sense to open a separate issue or PR for that.

MartinP7r · 2021-08-22T11:36:14Z

That makes a lot of sense. Thank you for the explanation!🙂
Since it's the underlying parser's behavior this is working as intended.
Thank you for taking the time to answer this and sorry for the excursion.

MaxDesiatov · 2021-08-22T11:50:38Z

No problem at all, thanks for digging into all of these details! Very illuminating and super helpful 🙂

MaxDesiatov · 2021-08-22T11:52:09Z

BTW, is there still anything to do within the scope of this PR? Or is it fully ready for review now?

MartinP7r · 2021-08-22T12:12:39Z

From my perspective it's complete.
Please let me know if you think there's something missing.

MaxDesiatov

LGTM, many thanks!

mflint · 2021-11-05T21:07:06Z

Thank-you for raising this issue, and fixing it, @MartinP7r.

I found that this was affecting text containing &. For example, "one & two" would be parsed into the string one&two. I love investigating a problem, and then finding that someone else has already fixed it! ❤️

MaxDesiatov · 2021-11-05T21:22:54Z

Thanks for letting me know this is a widespread problem. I'll tag 0.13.1 with this fix soon.

MartinP7r · 2021-11-07T02:48:03Z

@mflint, thank you for your kind words! It's very rewarding to hear that this is helpful to others 😃

Also, thank you for bringing up your use case. It might not be entirely necessary, but since XML entities like &, < or > are somewhat special, I opened a PR (#234) adding another test.

MartinP7r force-pushed the fix/nonStandardCharacter_SpacePreservation branch from 5ab0d61 to 7c693a0 Compare August 14, 2021 08:01

trim only completed string elements

01f4cd5

Refactor `process()` function name to `trimWhitespacesIfNeeded()`

Add additional tests for non-standard characters

48393da

for nested xml and untrimmed case thanks to @wooj2 for the suggestion!

MartinP7r marked this pull request as ready for review August 22, 2021 01:53

MaxDesiatov changed the title ~~[WIP] fix trimValueWhitespaces removing needed white-spaces~~ Fix trimValueWhitespaces removing needed white-spaces Aug 22, 2021

MaxDesiatov approved these changes Aug 22, 2021

View reviewed changes

MaxDesiatov merged commit 5c5b3f2 into CoreOffice:main Aug 22, 2021

MartinP7r mentioned this pull request Nov 7, 2021

Add test for preserved spaces with xml entities #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `trimValueWhitespaces` removing needed white-spaces #226

Fix `trimValueWhitespaces` removing needed white-spaces #226

MartinP7r commented Aug 14, 2021 •

edited by MaxDesiatov

MaxDesiatov commented Aug 14, 2021

MartinP7r commented Aug 16, 2021

MartinP7r commented Aug 20, 2021 •

edited

codecov bot commented Aug 20, 2021 •

edited

wooj2 commented Aug 20, 2021 •

edited

MartinP7r commented Aug 21, 2021 •

edited

MaxDesiatov commented Aug 21, 2021

MartinP7r commented Aug 22, 2021 •

edited

MaxDesiatov commented Aug 22, 2021

MartinP7r commented Aug 22, 2021

MaxDesiatov commented Aug 22, 2021

MaxDesiatov commented Aug 22, 2021

MartinP7r commented Aug 22, 2021

MaxDesiatov left a comment

mflint commented Nov 5, 2021 •

edited

MaxDesiatov commented Nov 5, 2021

MartinP7r commented Nov 7, 2021 •

edited

Fix trimValueWhitespaces removing needed white-spaces #226

Fix trimValueWhitespaces removing needed white-spaces #226

Conversation

MartinP7r commented Aug 14, 2021 • edited by MaxDesiatov

MaxDesiatov commented Aug 14, 2021

MartinP7r commented Aug 16, 2021

MartinP7r commented Aug 20, 2021 • edited

codecov bot commented Aug 20, 2021 • edited

Codecov Report

wooj2 commented Aug 20, 2021 • edited

MartinP7r commented Aug 21, 2021 • edited

MaxDesiatov commented Aug 21, 2021

MartinP7r commented Aug 22, 2021 • edited

MaxDesiatov commented Aug 22, 2021

MartinP7r commented Aug 22, 2021

MaxDesiatov commented Aug 22, 2021

MaxDesiatov commented Aug 22, 2021

MartinP7r commented Aug 22, 2021

MaxDesiatov left a comment

Choose a reason for hiding this comment

mflint commented Nov 5, 2021 • edited

MaxDesiatov commented Nov 5, 2021

MartinP7r commented Nov 7, 2021 • edited

Fix `trimValueWhitespaces` removing needed white-spaces #226

Fix `trimValueWhitespaces` removing needed white-spaces #226

MartinP7r commented Aug 14, 2021 •

edited by MaxDesiatov

MartinP7r commented Aug 20, 2021 •

edited

codecov bot commented Aug 20, 2021 •

edited

wooj2 commented Aug 20, 2021 •

edited

MartinP7r commented Aug 21, 2021 •

edited

MartinP7r commented Aug 22, 2021 •

edited

mflint commented Nov 5, 2021 •

edited

MartinP7r commented Nov 7, 2021 •

edited