Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard YAML frontmatter syntax causes lists to not be detected #5

Open
4 tasks done
nathanlesage opened this issue Jan 13, 2023 · 12 comments
Open
4 tasks done
Labels
👍 phase/yes Post is accepted and can be worked on 🐛 type/bug This is a problem

Comments

@nathanlesage
Copy link

Initial checklist

Affected packages and versions

micromark-extension-frontmatter v1.0.0

Link to runnable example

https://codesandbox.io/s/hopeful-matan-nybskz?file=%2Fsrc%2Findex.js

Steps to reproduce

I have a workload where I need to transform Markdown documents that can also include frontmatters to a syntax tree, using parse() on the remark processor.

It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter). It works, however, when I use Pandoc-style delimiters (three dashes on top, three dots on the bottom). Usage of the plugin then disables the processor's ability to detect lists properly.

Here are the steps to reproduce. Above I have linked a sandbox that is already properly set up. In there, simply exchange the end-fence for the frontmatter between dashes and dots and observe the difference in produced syntax trees.

Processor setup

export function md2ast (markdown: string): Root {
  return remark()
    .use(remarkFrontmatter, [
      // Either Pandoc-style frontmatters ...
      { type: 'yaml', fence: { open: '---', close: '...' } },
      // ... or Jekyll/Static site generators-style frontmatters.
      { type: 'yaml', fence: { open: '---', close: '---' } }
    ])
    .use(remarkMath)
    .parse(markdown)
}

Note that while I am being verbose in explicitly stating the delimiters, exchanging the second definition to simply yaml does not change the effect.

Test document

Use this document and run it through said pipeline. You will notice that, when you use three dashes to end the frontmatter, the list is not correctly detected, whereas, when you exchange that with three dots (Pandoc-style frontmatter), it will be correctly detected.

---
title: "The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques"
date: 2022-12-05
id: 20221205151411
author: Hendrik Erz
---

# Export Link Removal

Export this file into any format to test out the corresponding LUA filter.

* History of NLP in Sociology (from Mosteller and Wallace to today)
* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)
* What are they being used for?
    * This is a worngly written second-indended word
* Where do these methods come from?
* Are there already ethical notes around? Or don’t they care?

Expected behavior

Correct behavior with Pandoc-style frontmatter

If you exchange the three dashes with three dots in the test-document, it works as expected.

This is the correct Syntax Tree:

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "list",
      "ordered": false,
      "start": null,
      "spread": false,
      "children": [
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "History of NLP in Sociology (from Mosteller and Wallace to today)",
                  "position": {
                    "start": { "line": 12, "column": 3, "offset": 263 },
                    "end": { "line": 12, "column": 68, "offset": 328 }
                  }
                }
              ],
              "position": {
                "start": { "line": 12, "column": 3, "offset": 263 },
                "end": { "line": 12, "column": 68, "offset": 328 }
              }
            }
          ],
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 12, "column": 68, "offset": 328 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)",
                  "position": {
                    "start": { "line": 13, "column": 3, "offset": 331 },
                    "end": { "line": 13, "column": 161, "offset": 489 }
                  }
                }
              ],
              "position": {
                "start": { "line": 13, "column": 3, "offset": 331 },
                "end": { "line": 13, "column": 161, "offset": 489 }
              }
            }
          ],
          "position": {
            "start": { "line": 13, "column": 1, "offset": 329 },
            "end": { "line": 13, "column": 161, "offset": 489 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "What are they being used for?",
                  "position": {
                    "start": { "line": 14, "column": 3, "offset": 492 },
                    "end": { "line": 14, "column": 32, "offset": 521 }
                  }
                }
              ],
              "position": {
                "start": { "line": 14, "column": 3, "offset": 492 },
                "end": { "line": 14, "column": 32, "offset": 521 }
              }
            },
            {
              "type": "list",
              "ordered": false,
              "start": null,
              "spread": false,
              "children": [
                {
                  "type": "listItem",
                  "spread": false,
                  "checked": null,
                  "children": [
                    {
                      "type": "paragraph",
                      "children": [
                        {
                          "type": "text",
                          "value": "This is a worngly written second-indended word",
                          "position": {
                            "start": { "line": 15, "column": 7, "offset": 528 },
                            "end": { "line": 15, "column": 53, "offset": 574 }
                          }
                        }
                      ],
                      "position": {
                        "start": { "line": 15, "column": 7, "offset": 528 },
                        "end": { "line": 15, "column": 53, "offset": 574 }
                      }
                    }
                  ],
                  "position": {
                    "start": { "line": 15, "column": 5, "offset": 526 },
                    "end": { "line": 15, "column": 53, "offset": 574 }
                  }
                }
              ],
              "position": {
                "start": { "line": 15, "column": 5, "offset": 526 },
                "end": { "line": 15, "column": 53, "offset": 574 }
              }
            }
          ],
          "position": {
            "start": { "line": 14, "column": 1, "offset": 490 },
            "end": { "line": 15, "column": 53, "offset": 574 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Where do these methods come from?",
                  "position": {
                    "start": { "line": 16, "column": 3, "offset": 577 },
                    "end": { "line": 16, "column": 36, "offset": 610 }
                  }
                }
              ],
              "position": {
                "start": { "line": 16, "column": 3, "offset": 577 },
                "end": { "line": 16, "column": 36, "offset": 610 }
              }
            }
          ],
          "position": {
            "start": { "line": 16, "column": 1, "offset": 575 },
            "end": { "line": 16, "column": 36, "offset": 610 }
          }
        },
        {
          "type": "listItem",
          "spread": false,
          "checked": null,
          "children": [
            {
              "type": "paragraph",
              "children": [
                {
                  "type": "text",
                  "value": "Are there already ethical notes around? Or don’t they care?",
                  "position": {
                    "start": { "line": 17, "column": 3, "offset": 613 },
                    "end": { "line": 17, "column": 62, "offset": 672 }
                  }
                }
              ],
              "position": {
                "start": { "line": 17, "column": 3, "offset": 613 },
                "end": { "line": 17, "column": 62, "offset": 672 }
              }
            }
          ],
          "position": {
            "start": { "line": 17, "column": 1, "offset": 611 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Actual behavior

Incorrect behavior with Jekyll-style frontmatter

If you just run the above test document (with three dashes), it produces a wrong syntax tree. Observe how it does not detect the list properly.

{
  "type": "root",
  "children": [
    {
      "type": "yaml",
      "value": "title: \"The Devil is in the Details: Ethical Pitfalls in the Sociological use of NLP techniques\"\ndate: 2022-12-05\nid: 20221205151411\nauthor: Hendrik Erz",
      "position": {
        "start": { "line": 1, "column": 1, "offset": 0 },
        "end": { "line": 6, "column": 4, "offset": 160 }
      }
    },
    {
      "type": "heading",
      "depth": 1,
      "children": [
        {
          "type": "text",
          "value": "Export Link Removal",
          "position": {
            "start": { "line": 8, "column": 3, "offset": 164 },
            "end": { "line": 8, "column": 22, "offset": 183 }
          }
        }
      ],
      "position": {
        "start": { "line": 8, "column": 1, "offset": 162 },
        "end": { "line": 8, "column": 22, "offset": 183 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Export this file into any format to test out the corresponding LUA filter.",
          "position": {
            "start": { "line": 10, "column": 1, "offset": 185 },
            "end": { "line": 10, "column": 75, "offset": 259 }
          }
        }
      ],
      "position": {
        "start": { "line": 10, "column": 1, "offset": 185 },
        "end": { "line": 10, "column": 75, "offset": 259 }
      }
    },
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "* History of NLP in Sociology (from Mosteller and Wallace to today)\n* What methods are in use? (three types: Bayes/simple such as Logistic Regression; Machine Learning such as LDA/random forests; deep learning such as LSTM/BERT)\n* What are they being used for?\n* This is a worngly written second-indended word\n* Where do these methods come from?\n* Are there already ethical notes around? Or don’t they care?",
          "position": {
            "start": { "line": 12, "column": 1, "offset": 261 },
            "end": { "line": 17, "column": 62, "offset": 672 }
          }
        }
      ],
      "position": {
        "start": { "line": 12, "column": 1, "offset": 261 },
        "end": { "line": 17, "column": 62, "offset": 672 }
      }
    }
  ],
  "position": {
    "start": { "line": 1, "column": 1, "offset": 0 },
    "end": { "line": 18, "column": 1, "offset": 673 }
  }
}

Runtime

Node v16, Other (please specify in steps to reproduce)

Package manager

yarn v1

OS

macOS

Build and bundle tools

Webpack

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jan 13, 2023
@wooorm
Copy link
Member

wooorm commented Jan 13, 2023

So first, for the code sandbox:

Frontmatter starts at the first line. You can test this on GH: adding blank lines before the YAML doesn’t work. Start with:

const sourceMarkdown = `---

…and it works.

@wooorm
Copy link
Member

wooorm commented Jan 13, 2023

It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter)

What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested

@wooorm
Copy link
Member

wooorm commented Jan 13, 2023

Here’s a reduced test case:

import {fromMarkdown} from 'mdast-util-from-markdown'
import {frontmatter} from 'micromark-extension-frontmatter'
import {frontmatterFromMarkdown} from 'mdast-util-frontmatter'

const sourceMarkdown = `---
title: a

* b
`;

const tree = fromMarkdown(sourceMarkdown, {
  extensions: [frontmatter()],
  mdastExtensions: [frontmatterFromMarkdown()]
})

console.log(tree) // Contains a paragraph with `* b` as text, not a list.

The reason is that this looks a lot like frontmatter. So it continues on checking all those lines for the ending. When doing that, it prevents the lists from being checked. Afterwards, it can’t do those lists again.

@wooorm wooorm added 🐛 type/bug This is a problem 👍 phase/yes Post is accepted and can be worked on labels Jan 13, 2023
@github-actions github-actions bot removed the 🤞 phase/open Post is being triaged manually label Jan 13, 2023
@github-actions
Copy link

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

@nathanlesage
Copy link
Author

Thank you for your swift reply! You are right that I accidentally added a newline there. However, when I remove that, the list is still not properly being detected. Here's a screenshot from the updated sandbox:

image

What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested

I am sorry if I came across as ungrateful or anything — I am a big fan of your work and appreciate everything you do. I did some extensive checking this morning to find the bug, and ended up finding this oddity. I am, however, not sure what kind of interaction effect is (I believe that each plugin itself works fine, absolutely, but since we have this weird bug, I suspect some interaction – that's also why I'm not sure if this emanates from here, or from the remark plugin that wraps this one…? I am thankful for any guidance in this regard.)

Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?

---
title: a
---

* b

One additional thing I noted, but which I believe can be fixed when this bug is sorted out, is that if the list is not being properly detected, the parser "eats" the indentation spaces in front of the second-order list item. You may note that there is this worngly spelled word in the indentation. The reason is that I'm using the syntax tree to run a spell checker, and I noted that the spaces in front of the list just get swallowed, shifting the offsets of the text nodes into a wrong position by that amount. I did not initially report that because I believe it is just a fallout effect of whatever this is.

I greatly appreciate any guidance in this regard. Thank you!

@wooorm
Copy link
Member

wooorm commented Jan 13, 2023

am sorry if I came across as ungrateful or anything

No need, no worries, you didn’t come across as such. See my later response for a reduced case. You are correct, there is a bug.

Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?

Because that’s exactly what’s going on here: if an opening but no closing fence is found, lists don’t work. You had --- and ... in your first case. Which, when starting with --- but without an ending of ..., shows that exact bug. But your second case (start and end ---) masked what was happening. Because some frontmatter was parsed.


For the second bug, whitespace being eaten, I think it’s unrelated to this, probably. It’s because whitespace in paragraphs is removed. E.g.:

asd
                                   whatever

That positional info might be lost somewhere though. As the text node is asd\nwhatever for that. But also this markdown would result in the exact same text node:

> * asd
> whatever

Maybe some positional info is lost somewhere

@nathanlesage
Copy link
Author

Ahhhh, I see what you mean, yes! Thank you for the clarification!

@nathanlesage
Copy link
Author

Hey @wooorm, just as an update: I have now figured out that the fault in the options also prevents the Gfm parser (remark-gfm) from detecting footnotes (possibly other elements as well). It works fine if I just comment out the corresponding config option { type: 'yaml', fence: { open: '---', close: '...' } }.

@wooorm
Copy link
Member

wooorm commented Feb 8, 2023

Yep, block quotes, lists, and footnote definitions! All “containers” that fail on this :(
Will be maybe a bit less than 2 months before I can get to it unfortunately.

@nathanlesage
Copy link
Author

Hey, thanks for your quick reply! I can absolutely understand if you can't spend time on it right now. I would be happy to prepare a PR, since I think I found the cause, but it would be ideal if you could point me to a few resources so that I can make the fix work properly and stick to the unified conventions.

Here's what I found:

Because both frontmatters (The Pandoc style – --- / ... – and the jekyll-style – --- / ---) start with the same character code (45), the tokenizer layout looks like this:

{
    flow: {
      '45': [
        tokenizer, // Parses --- / ...
        tokenizer // Parses --- / ---
      ]
    }
}

This makes the engine go through the document twice, once for the --- / ... frontmatter and then for the --- / ---frontmatter type. The first cannot find the ... ending fence and hence parses the full document as a frontmatter value, thereby disturbing the other parsers.

There are two possible solutions:

If the engine allows one to abort parsing anything if the given frontmatter type is not detected, then we could add a break condition so that the parser that doesn't fit does not attempt to detect anything.

The other solution would be to merge the openings and closings, i.e., allow for an array of strings in the open/close options that configure the plugin. I've seen that the buffer variable is being used to detect a frontmatter fence, I could rewrite the code so that it accepts an array and just tests for every possibility. Although, since the plugins themselves actually work perfectly fine, if I just add a single frontmatter definition, it might be much cleaner to add the aforementioned "break condition".

What are your thoughts? I can get a PR ready for review today so that it's fixed!

@wooorm
Copy link
Member

wooorm commented Feb 8, 2023

See my small reproduction above: #5 (comment). This issue exists for normal frontmatter. A single format. ---\n* a.

I think it’s going to get super complex: the engine does not really support aborting. Because normal markdown doesn’t have that: once there’s a line that opens fenced code, anything after it is fenced code, for example.
And you can’t have a list item in those.
Same here. The YAML is open. You can’t have a list item in it.
But it never closed. And we don’t have list items.
So micromark needs an architectural overhaul to solve.

@nathanlesage
Copy link
Author

Ouh … alright. This explains why I wasn't getting anywhere with my attempts at fixing in this regard … :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👍 phase/yes Post is accepted and can be worked on 🐛 type/bug This is a problem
Development

No branches or pull requests

2 participants