Standard YAML frontmatter syntax causes lists to not be detected #5

nathanlesage · 2023-01-13T14:26:27Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark-extension-frontmatter v1.0.0

Link to runnable example

https://codesandbox.io/s/hopeful-matan-nybskz?file=%2Fsrc%2Findex.js

Steps to reproduce

I have a workload where I need to transform Markdown documents that can also include frontmatters to a syntax tree, using parse() on the remark processor.

It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter). It works, however, when I use Pandoc-style delimiters (three dashes on top, three dots on the bottom). Usage of the plugin then disables the processor's ability to detect lists properly.

Here are the steps to reproduce. Above I have linked a sandbox that is already properly set up. In there, simply exchange the end-fence for the frontmatter between dashes and dots and observe the difference in produced syntax trees.

Processor setup

export function md2ast (markdown: string): Root {
  return remark()
    .use(remarkFrontmatter, [
      // Either Pandoc-style frontmatters ...
      { type: 'yaml', fence: { open: '---', close: '...' } },
      // ... or Jekyll/Static site generators-style frontmatters.
      { type: 'yaml', fence: { open: '---', close: '---' } }
    ])
    .use(remarkMath)
    .parse(markdown)
}

Note that while I am being verbose in explicitly stating the delimiters, exchanging the second definition to simply yaml does not change the effect.

Test document

Use this document and run it through said pipeline. You will notice that, when you use three dashes to end the frontmatter, the list is not correctly detected, whereas, when you exchange that with three dots (Pandoc-style frontmatter), it will be correctly detected.

---
title: "The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques"
date: 2022-12-05
id: 20221205151411
author: Hendrik Erz
---

# Export Link Removal

Export this file into any format to test out the corresponding LUA filter.

* History of NLP in Sociology (from Mosteller and Wallace to today)
* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)
* What are they being used for?
    * This is a worngly written second-indended word
* Where do these methods come from?
* Are there already ethical notes around? Or don’t they care?

Expected behavior

Correct behavior with Pandoc-style frontmatter

If you exchange the three dashes with three dots in the test-document, it works as expected.

This is the correct Syntax Tree:

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "list",
      "ordered": false,
      "start": null,
      "spread": false,
      "children": [
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "History of NLP in Sociology (from Mosteller and Wallace to today)",
                  "position": {
                    "start": { "line": 12, "column": 3, "offset": 263 },
                    "end": { "line": 12, "column": 68, "offset": 328 }
                  }
                }
              ],
              "position": {
                "start": { "line": 12, "column": 3, "offset": 263 },
                "end": { "line": 12, "column": 68, "offset": 328 }
              }
            }
          ],
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 12, "column": 68, "offset": 328 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)",
                  "position": {
                    "start": { "line": 13, "column": 3, "offset": 331 },
                    "end": { "line": 13, "column": 161, "offset": 489 }
                  }
                }
              ],
              "position": {
                "start": { "line": 13, "column": 3, "offset": 331 },
                "end": { "line": 13, "column": 161, "offset": 489 }
              }
            }
          ],
          "position": {
            "start": { "line": 13, "column": 1, "offset": 329 },
            "end": { "line": 13, "column": 161, "offset": 489 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What are they being used for?",
                  "position": {
                    "start": { "line": 14, "column": 3, "offset": 492 },
                    "end": { "line": 14, "column": 32, "offset": 521 }
                  }
                }
              ],
              "position": {
                "start": { "line": 14, "column": 3, "offset": 492 },
                "end": { "line": 14, "column": 32, "offset": 521 }
              }
            },
            {
              "type": "list",
              "ordered": false,
              "start": null,
              "spread": false,
              "children": [
                {
                  "type": "listItem",
                  "spread": false,
                  "checked": null,
                  "children": [
                    {
                      "type": "paragraph",
                      "children": [
                        {
                          "type": "text",
                          "value": "This is a worngly written second-indended word",
                          "position": {
                            "start": { "line": 15, "column": 7, "offset": 528 },
                            "end": { "line": 15, "column": 53, "offset": 574 }
                          }
                        }
                      ],
                      "position": {
                        "start": { "line": 15, "column": 7, "offset": 528 },
                        "end": { "line": 15, "column": 53, "offset": 574 }
                      }
                    }
                  ],
                  "position": {
                    "start": { "line": 15, "column": 5, "offset": 526 },
                    "end": { "line": 15, "column": 53, "offset": 574 }
                  }
                }
              ],
              "position": {
                "start": { "line": 15, "column": 5, "offset": 526 },
                "end": { "line": 15, "column": 53, "offset": 574 }
              }
            }
          ],
          "position": {
            "start": { "line": 14, "column": 1, "offset": 490 },
            "end": { "line": 15, "column": 53, "offset": 574 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Where do these methods come from?",
                  "position": {
                    "start": { "line": 16, "column": 3, "offset": 577 },
                    "end": { "line": 16, "column": 36, "offset": 610 }
                  }
                }
              ],
              "position": {
                "start": { "line": 16, "column": 3, "offset": 577 },
                "end": { "line": 16, "column": 36, "offset": 610 }
              }
            }
          ],
          "position": {
            "start": { "line": 16, "column": 1, "offset": 575 },
            "end": { "line": 16, "column": 36, "offset": 610 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Are there already ethical notes around? Or don’t they care?",
                  "position": {
                    "start": { "line": 17, "column": 3, "offset": 613 },
                    "end": { "line": 17, "column": 62, "offset": 672 }
                  }
                }
              ],
              "position": {
                "start": { "line": 17, "column": 3, "offset": 613 },
                "end": { "line": 17, "column": 62, "offset": 672 }
              }
            }
          ],
          "position": {
            "start": { "line": 17, "column": 1, "offset": 611 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Actual behavior

Incorrect behavior with Jekyll-style frontmatter

If you just run the above test document (with three dashes), it produces a wrong syntax tree. Observe how it does not detect the list properly.

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "* History of NLP in Sociology (from Mosteller and Wallace to today)\n* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)\n* What are they being used for?\n* This is a worngly written second-indended word\n* Where do these methods come from?\n* Are there already ethical notes around? Or don’t they care?",
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Runtime

Node v16, Other (please specify in steps to reproduce)

Package manager

yarn v1

OS

macOS

Build and bundle tools

Webpack

The text was updated successfully, but these errors were encountered:

wooorm · 2023-01-13T14:40:46Z

So first, for the code sandbox:

Frontmatter starts at the first line. You can test this on GH: adding blank lines before the YAML doesn’t work. Start with:

const sourceMarkdown = `---

…and it works.

wooorm · 2023-01-13T14:44:48Z

It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter)

What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested

wooorm · 2023-01-13T14:50:21Z

Here’s a reduced test case:

import {fromMarkdown} from 'mdast-util-from-markdown'
import {frontmatter} from 'micromark-extension-frontmatter'
import {frontmatterFromMarkdown} from 'mdast-util-frontmatter'

const sourceMarkdown = `---
title: a

* b
`;

const tree = fromMarkdown(sourceMarkdown, {
  extensions: [frontmatter()],
  mdastExtensions: [frontmatterFromMarkdown()]
})

console.log(tree) // Contains a paragraph with `* b` as text, not a list.

The reason is that this looks a lot like frontmatter. So it continues on checking all those lines for the ending. When doing that, it prevents the lists from being checked. Afterwards, it can’t do those lists again.

github-actions · 2023-01-13T14:51:54Z

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

nathanlesage · 2023-01-13T15:29:29Z

Thank you for your swift reply! You are right that I accidentally added a newline there. However, when I remove that, the list is still not properly being detected. Here's a screenshot from the updated sandbox:

What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested

I am sorry if I came across as ungrateful or anything — I am a big fan of your work and appreciate everything you do. I did some extensive checking this morning to find the bug, and ended up finding this oddity. I am, however, not sure what kind of interaction effect is (I believe that each plugin itself works fine, absolutely, but since we have this weird bug, I suspect some interaction – that's also why I'm not sure if this emanates from here, or from the remark plugin that wraps this one…? I am thankful for any guidance in this regard.)

Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?

---
title: a
---

* b

One additional thing I noted, but which I believe can be fixed when this bug is sorted out, is that if the list is not being properly detected, the parser "eats" the indentation spaces in front of the second-order list item. You may note that there is this worngly spelled word in the indentation. The reason is that I'm using the syntax tree to run a spell checker, and I noted that the spaces in front of the list just get swallowed, shifting the offsets of the text nodes into a wrong position by that amount. I did not initially report that because I believe it is just a fallout effect of whatever this is.

I greatly appreciate any guidance in this regard. Thank you!

wooorm · 2023-01-13T15:44:21Z

am sorry if I came across as ungrateful or anything

No need, no worries, you didn’t come across as such. See my later response for a reduced case. You are correct, there is a bug.

Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?

Because that’s exactly what’s going on here: if an opening but no closing fence is found, lists don’t work. You had --- and ... in your first case. Which, when starting with --- but without an ending of ..., shows that exact bug. But your second case (start and end ---) masked what was happening. Because some frontmatter was parsed.

For the second bug, whitespace being eaten, I think it’s unrelated to this, probably. It’s because whitespace in paragraphs is removed. E.g.:

asd
                                   whatever

That positional info might be lost somewhere though. As the text node is asd\nwhatever for that. But also this markdown would result in the exact same text node:

> * asd
> whatever

Maybe some positional info is lost somewhere

nathanlesage · 2023-01-13T15:57:27Z

Ahhhh, I see what you mean, yes! Thank you for the clarification!

nathanlesage · 2023-02-08T08:07:31Z

Hey @wooorm, just as an update: I have now figured out that the fault in the options also prevents the Gfm parser (remark-gfm) from detecting footnotes (possibly other elements as well). It works fine if I just comment out the corresponding config option { type: 'yaml', fence: { open: '---', close: '...' } }.

wooorm · 2023-02-08T08:59:37Z

Yep, block quotes, lists, and footnote definitions! All “containers” that fail on this :(
Will be maybe a bit less than 2 months before I can get to it unfortunately.

nathanlesage · 2023-02-08T09:15:34Z

Hey, thanks for your quick reply! I can absolutely understand if you can't spend time on it right now. I would be happy to prepare a PR, since I think I found the cause, but it would be ideal if you could point me to a few resources so that I can make the fix work properly and stick to the unified conventions.

Here's what I found:

Because both frontmatters (The Pandoc style – --- / ... – and the jekyll-style – --- / ---) start with the same character code (45), the tokenizer layout looks like this:

{
    flow: {
      '45': [
        tokenizer, // Parses --- / ...
        tokenizer // Parses --- / ---
      ]
    }
}

This makes the engine go through the document twice, once for the --- / ... frontmatter and then for the --- / ---frontmatter type. The first cannot find the ... ending fence and hence parses the full document as a frontmatter value, thereby disturbing the other parsers.

There are two possible solutions:

If the engine allows one to abort parsing anything if the given frontmatter type is not detected, then we could add a break condition so that the parser that doesn't fit does not attempt to detect anything.

The other solution would be to merge the openings and closings, i.e., allow for an array of strings in the open/close options that configure the plugin. I've seen that the buffer variable is being used to detect a frontmatter fence, I could rewrite the code so that it accepts an array and just tests for every possibility. Although, since the plugins themselves actually work perfectly fine, if I just add a single frontmatter definition, it might be much cleaner to add the aforementioned "break condition".

What are your thoughts? I can get a PR ready for review today so that it's fixed!

wooorm · 2023-02-08T09:43:48Z

See my small reproduction above: #5 (comment). This issue exists for normal frontmatter. A single format. ---\n* a.

I think it’s going to get super complex: the engine does not really support aborting. Because normal markdown doesn’t have that: once there’s a line that opens fenced code, anything after it is fenced code, for example.
And you can’t have a list item in those.
Same here. The YAML is open. You can’t have a list item in it.
But it never closed. And we don’t have list items.
So micromark needs an architectural overhaul to solve.

nathanlesage · 2023-02-08T10:36:44Z

Ouh … alright. This explains why I wasn't getting anywhere with my attempts at fixing in this regard … :/

github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jan 13, 2023

wooorm added 🐛 type/bug This is a problem 👍 phase/yes Post is accepted and can be worked on labels Jan 13, 2023

github-actions bot removed the 🤞 phase/open Post is being triaged manually label Jan 13, 2023

nathanlesage mentioned this issue May 4, 2023

@khanacademy/simple-markdown imposes peer dependency on an old version of react mermaid-js/mermaid#4302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standard YAML frontmatter syntax causes lists to not be detected #5

Standard YAML frontmatter syntax causes lists to not be detected #5

nathanlesage commented Jan 13, 2023

wooorm commented Jan 13, 2023 •

edited

wooorm commented Jan 13, 2023

wooorm commented Jan 13, 2023 •

edited

github-actions bot commented Jan 13, 2023

nathanlesage commented Jan 13, 2023

wooorm commented Jan 13, 2023

nathanlesage commented Jan 13, 2023

nathanlesage commented Feb 8, 2023

wooorm commented Feb 8, 2023

nathanlesage commented Feb 8, 2023

wooorm commented Feb 8, 2023

nathanlesage commented Feb 8, 2023

Standard YAML frontmatter syntax causes lists to not be detected #5

Standard YAML frontmatter syntax causes lists to not be detected #5

Comments

nathanlesage commented Jan 13, 2023

Initial checklist

Affected packages and versions

Link to runnable example

Steps to reproduce

Processor setup

Test document

Expected behavior

Correct behavior with Pandoc-style frontmatter

Actual behavior

Incorrect behavior with Jekyll-style frontmatter

Runtime

Package manager

OS

Build and bundle tools

wooorm commented Jan 13, 2023 • edited

wooorm commented Jan 13, 2023

wooorm commented Jan 13, 2023 • edited

github-actions bot commented Jan 13, 2023

nathanlesage commented Jan 13, 2023

wooorm commented Jan 13, 2023

nathanlesage commented Jan 13, 2023

nathanlesage commented Feb 8, 2023

wooorm commented Feb 8, 2023

nathanlesage commented Feb 8, 2023

wooorm commented Feb 8, 2023

nathanlesage commented Feb 8, 2023

wooorm commented Jan 13, 2023 •

edited

wooorm commented Jan 13, 2023 •

edited