Rework markdown parser to fix #4613 #4677

codesoap · 2024-02-26T10:22:45Z

Description:

First of all: Sorry for the huge diff, without any previous conversation. I had an idea for a rework, but before suggesting it, I wanted to see if the idea would even work. Before I knew it, I fell into a rabbit hole and now here I am with this PR.

Inspired by #4613, I read through markdown.go and investigated goldmark a little bit (here's a tool to visualize goldmark's ASTs: goldmarkvis). I got an understanding of where spaces should be added, but couldn't find an easy way to incorporate my newfound knowledge into the existing design. I felt like it would be easier, if the AST were rendered with recursive code.

I went ahead and tried it out. With this approach, I was able to fix all the problems mentioned in #4613 (including the "split link"). As a side effect, the code also got a little more compact and slightly more readable (but that's highly subjective, of course).

I have also added a benchmark, which shows a slight decline of performance with the new implementation, but it's not too bad and I would assume that performance would have decreased anyway, no matter how #4613 would have been fixed:

Before: BenchmarkMarkdownParsing-4 8527 135023 ns/op 34864 B/op 306 allocs/op
After: BenchmarkMarkdownParsing-4 7798 149466 ns/op 35920 B/op 328 allocs/op

Since the new implementation adds leading spaces instead of trailing ones, and also adds new segments for some spaces and the end of paragraphs, I had to adapt the existing tests slightly.

Fixes #4613

Checklist:

Tests included.
Lint and formatter run with no errors.
Tests all pass.

The new version simplifies the code by using recursion instead of walking the ast while keeping track of state. It also improves the code by using a type switch. Fixes fyne-io#4613

coveralls · 2024-02-26T10:27:48Z

coverage: 65.013% (+0.2%) from 64.783%
when pulling 471f1cf on codesoap:markdown
into 6b7246b on fyne-io:develop.

andydotxyz · 2024-02-26T18:01:25Z

Since the new implementation adds leading spaces instead of trailing ones, and also adds new segments for some spaces and the end of paragraphs, I had to adapt the existing tests slightly.

Moving where spaces are inserted seems OK, so the test changes where it is moved from one segment to another seems OK.
However adding trailing spaces to handle the newline doesn't seem right - because these spaces will be inserted into the content and the String() method no longer correct.

Unless I misunderstood what you meant?

codesoap · 2024-02-26T21:45:35Z

Sorry, I didn't explain it enough. String() should not be affected. No superfluous spaces are inserted. Besides the swap from trailing to leading spaces, there are these two new special segments:

goldmark seems to signal a single newline after emph/strong/link with an ast.Text node, which has no content (""). This is translated to a TextSegment with the Text: " ". We need to play along with this quirk of goldmark, because (as far as I can see) this is the only way we can distinguish *foo*\nbar from *foo*bar.
Previously the last segment of a paragraph got modified with text.Style.Inline = false. As far as I can see, this won't work if the last element of a paragraph is, for example, a HyperlinkSegment, because this doesn't have a Style. Thus I now always add a TextSegment with Style: RichTextStyleParagraph and no Text at the end of paragraphs.

An example will probably explain it better:

*foo*
bar

Will now be parsed to the following 4 TextSegments:

A segment with RichTextStyleEmphasis and Text: "foo".
A segment with RichTextStyleInline and Text: " " (result of 1. from above).
A segment with RichTextStyleInline and Text: "bar".
A segment with RichTextStyleParagraph and Text: "" (result of 2. from above).

andydotxyz · 2024-02-29T09:09:59Z

I think I follow.
To confirm this could you check in the tests that the newly inserted text segments for line break are indeed Text with "" and not " " just to be sure? Or at least some of them, just so we can't regress this special marker to something else in the future.

codesoap · 2024-02-29T11:01:34Z

I have added an explicit test for the last segment of a paragraph. Is this OK? I can also adapt the existing test functions, if you'd prefer that.

montovaneli · 2024-04-05T18:55:56Z

Sorry to jump into the discussion, but I'm really interested in this pull request.

I have a case using a list + new paragraph that doesn't seem right.
Testing on https://markdownlivepreview.com/

Now, using this pull request and the string s:= "- **List Item 1**: 1234\n\nNew paragraph\n\nAnother new paragraph":

Sorry if I misunderstood something, but it seems that the "New paragraph" is not a new paragraph at all.

andydotxyz · 2024-04-05T20:05:28Z

Sorry if I misunderstood something, but it seems that the "New paragraph" is not a new paragraph at all.

To know this you should look at the object tree of Segments. I suspect it is an unrelated bug where the gap between a block element and a paragraph are not spaced enough. Try testing before and after this PR patch applied to see if it changes things or not. I think they are unrelated.

andydotxyz · 2024-04-22T10:19:50Z

widget/markdown_test.go

 		assert.True(t, text.Inline())
 	} else {
 		t.Error("Segment should be Text")
 	}
 	if text, ok := r.Segments[1].(*TextSegment); ok {
-		assert.Equal(t, "line2", text.Text)
+		assert.Equal(t, " line2", text.Text)


Just coming back to this apologies for the delay. Is this a desirable change? Putting a space at the leading of a string seems like it could cause a problem.

Previously trailing spaces were inserted, now leading spaces are inserted instead. I cannot remember why I changed it, but I guess it made the implementation a little easier. I don't see how either choice would be better than the other. What kind of problem do you foresee?

The problem I foresee is that strings are indexed from the beginning, so if anyone was actually working with the content of a TextSegment then the characters inside them are now off-by-1.

Good point. I'll try to revert back to trailing spaces when I find some time.

I have now revisited the code and was able to return to trailing spaces instead of leading ones. In the process, the code even became a little more compact and the renderNode function requires one less argument.

I hope I didn't forget anything that I had considered when originally creating this pull request, but the tests cannot find any issues.

andydotxyz

Thanks so much for working this out

codesoap · 2024-05-07T16:22:39Z

Thanks for accepting this PR! I kinda feared it would be rejected because it changes too much and really appreciate you taking the time to consider it.

andydotxyz · 2024-05-08T08:36:08Z

The benefit of a robust unit test suite is that the size of the change isn't usually a problem. It just takes longer.

I always go for the smallest possible fix, but in some cases refactoring is the way forward :).

codesoap added 4 commits February 25, 2024 22:12

Add benchmark for parsing markdown

8defc34

Test newline handling in markdown

d3f858f

Rework markdown parser

dc3334e

The new version simplifies the code by using recursion instead of walking the ast while keeping track of state. It also improves the code by using a type switch. Fixes fyne-io#4613

Adapt tests to new markdown parser

9d89d24

Ensure that last segment of markdown paragraph is empty

aa8bb3a

andydotxyz reviewed Apr 22, 2024

View reviewed changes

Return to trailing spaces in the markdown parser

471f1cf

andydotxyz approved these changes May 7, 2024

View reviewed changes

andydotxyz merged commit 54154f9 into fyne-io:develop May 7, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework markdown parser to fix #4613 #4677

Rework markdown parser to fix #4613 #4677

codesoap commented Feb 26, 2024 •

edited

coveralls commented Feb 26, 2024 •

edited

andydotxyz commented Feb 26, 2024

codesoap commented Feb 26, 2024

andydotxyz commented Feb 29, 2024

codesoap commented Feb 29, 2024

montovaneli commented Apr 5, 2024 •

edited

andydotxyz commented Apr 5, 2024

andydotxyz Apr 22, 2024

codesoap Apr 28, 2024

andydotxyz Apr 29, 2024

codesoap Apr 29, 2024

codesoap May 5, 2024

andydotxyz left a comment

codesoap commented May 7, 2024

andydotxyz commented May 8, 2024

Rework markdown parser to fix #4613 #4677

Rework markdown parser to fix #4613 #4677

Conversation

codesoap commented Feb 26, 2024 • edited

Description:

Checklist:

coveralls commented Feb 26, 2024 • edited

andydotxyz commented Feb 26, 2024

codesoap commented Feb 26, 2024

andydotxyz commented Feb 29, 2024

codesoap commented Feb 29, 2024

montovaneli commented Apr 5, 2024 • edited

andydotxyz commented Apr 5, 2024

andydotxyz Apr 22, 2024

Choose a reason for hiding this comment

codesoap Apr 28, 2024

Choose a reason for hiding this comment

andydotxyz Apr 29, 2024

Choose a reason for hiding this comment

codesoap Apr 29, 2024

Choose a reason for hiding this comment

codesoap May 5, 2024

Choose a reason for hiding this comment

andydotxyz left a comment

Choose a reason for hiding this comment

codesoap commented May 7, 2024

andydotxyz commented May 8, 2024

codesoap commented Feb 26, 2024 •

edited

coveralls commented Feb 26, 2024 •

edited

montovaneli commented Apr 5, 2024 •

edited