Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference link autocomplete #893

Closed
gavbarnett opened this issue Feb 3, 2021 · 25 comments · Fixed by #898
Closed

Reference link autocomplete #893

gavbarnett opened this issue Feb 3, 2021 · 25 comments · Fixed by #898
Assignees
Labels
Area: Auto completion Code completion and suggestion, aka IntelliSense. Help wanted Looking for help. Issue: Enhancement Improvements in existing features. Res: Fixed Fix is checked in, but it might be a few weeks until a release.
Milestone

Comments

@gavbarnett
Copy link
Contributor

Proposal

When adding a new link the option to autocomplete from reference links should be implemented.

Example:

This is a [ReadMe reference link text].

[ReadMe reference link text]:(https://github.com/yzhang-gh/vscode-markdown/blob/master/README.md)

If the bottom reference line exists anywhere in the file then starting a new link text should provide option to autocomplete from the reference link text.

This is a [_this should autocomplete from all references in file if they exist_].

[ReadMe reference link text]:(https://github.com/yzhang-gh/vscode-markdown/blob/master/README.md)

This would seem similar but not the same as what is currently implemented

This is a [ReadMe][ReadMe reference link text]. the second [] is autocompleted but not the first

[ReadMe reference link text]:(https://github.com/yzhang-gh/vscode-markdown/blob/master/README.md)
@gavbarnett
Copy link
Contributor Author

gavbarnett commented Feb 3, 2021

I think this is a fairly simple regex addition in completion.ts line 496:

else if (/\[[^\]]*?\]\[[^\]]*$/.test(lineTextBefore)) {
            /* ┌───────────────────────┐
               │ Reference link labels │
               └───────────────────────┘ */

The above regex checks for essentially "[ ] [ ]"
Proposed change is as below to check for "[ ] [ ]" or "[ ]"

else if (/\[[^\]]*?\]\[[^\]]*$/.test(lineTextBefore) || /\[[^\]]*?$/.test(lineTextBefore)) {
            /* ┌───────────────────────┐
               │ Reference link labels │
               └───────────────────────┘ */

I suppose that's a bit redundant and simply checking for "[ ]" would suffice, and makes more sense:

else if (/\[[^\]]*?$/.test(lineTextBefore)) {
            /* ┌───────────────────────┐
               │ Reference link labels │
               └───────────────────────┘ */

I've never done stuff with VSCode extensions or commit to a lovely repo such as this before, but happy to have a go at this.

References

@gavbarnett
Copy link
Contributor Author

I actually think this is a bug not a feature request now.

It should check for only one [ ] not two [ ] [ ] when autocompleting reference links. Two is valid see spec example 523 , it's like a linkname to a linkref to a link, but it would work fine if the regex was only one set of square brackets.

@yzhang-gh
Copy link
Owner

I suppose that's a bit redundant and simply checking for "[ ]" would suffice, and makes more sense:

Looks you are right as a single [] can be "a shortcut reference link".

happy to have a go at this.

😉

@yzhang-gh yzhang-gh added Area: Auto completion Code completion and suggestion, aka IntelliSense. Help wanted Looking for help. labels Feb 4, 2021
@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 4, 2021

This is, I guess, related to #260, etc.

As you can see, around our codebase, there are so many regular expressions, which are based on presumptions and can only work on the most straightforward forms with local knowledge.

When trying to support non-trivial cases, strange scenarios come up very fast.
The original Markdown is designed to be like natural language. CommonMark is created with impenetrable compatibility in first place. Its derivatives also follow the philosophy. Undoubtedly, the language has become a set of exceptions and exceptions and exceptions ... Correctly understanding a Markdown document is even more complex than handling HTML, making our regexp-based operations fatally flawed. You can easily fool them, including the one you mentioned in this thread.

} else if (/\[[^\]]*?\]\[[^\]]*$/.test(lineTextBefore)) {

"Recognizing reference definitions" and "Inferring the user is typing a reference link" are both challenging:

  • The Spec requires case-insensitive label comparison. Well, it's not a big problem in terms of rendering. AFAIK, ToUpperCase solves it, and markdown-it indeed makes it in this way. However, when it comes to editing, obviously, we will have to maintain the original labels and normalized labels, since for some scripts, case conversion is not round-tripped.

  • The Spec permits "soft line break" and space to be freely interspersed in many structures, including links.

  • There are three kinds of reference links, together with lots of quirks. As far as I can see from other threads, preferences differ greatly.

To support more forms, we need to settle many tricky cases first, and figure out a strategy to balance user experience.


I'm now wondering if it's worth creating our own Markdown parser, since it is really hard to find an existing solution (in JavaScript) that meets the needs of an editor.

I tried markdown-it, but actually almost give up now. Its architecture blocks fine source mapping, thus, it is suitable for very limited scenarios.
#877 (comment)
#890
markdown-it/markdown-it#374 (comment)

I would not say the architecture of markdown-it is a failure. But I'd argue that breaking a CommonMark document into token stream is neither simpler nor more convenient than directly constructing syntax tree.


Good luck.

@Lemmingh Lemmingh added the Issue: Enhancement Improvements in existing features. label Feb 4, 2021
@gavbarnett
Copy link
Contributor Author

For my own understanding are you targeting Commonmark first/only or are you wanting compatibility with other popular variants like github markdown?

Sounds like thats a separate wider issue you'd like to solve re regex? Happy to start spending some time thinking about this but does sound complex. Do you want to raise an issue separately for tracking/discussion on this?

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 5, 2021

Markdown All in One focuses on CommonMark and GitHub Flavored Markdown (GFM), plus a few extensions urgently needed by the community.

https://markdown-all-in-one.github.io/docs/decisions/markdown-syntax-and-flavors.html


Sounds like thats a separate wider issue you'd like to solve

Issues have characteristics in common. Although this product acts like an initiator with a bunch of islands, I still try to consider things in an overall view first.


Sorry for not being clear enough.

You can see two main points from those threads (typically, #890, #877, #809):

  • Under the current model/architecture, we are facing lots of difficulties in providing accurate/robust results in current features, and implementing new features. In my memory, most bugs come from TOC-related operations, and there was a time when they were seemingly easy to patch. However, the situation has been going worse since version 3.0.0. As for new features, because functions can now honestly only gain local knowledge, even a small step leads to unbelievable complexity.

  • We are looking for a more normal way, that's parsing the whole document and taking advantage of symbol information, but existing solutions are disappointing. We gave markdown-it a try, but it's so rendering-centered from my perspective that I only succeeded in applying it to locating lists and generating slugs for GitHub. Maybe it's not its fault but of some defects of CommonMark specs. When CommonMark was named STMD (Standard Markdown), its wording and reference implementation were not satisfactory, upsetting a few people; even today, some confusing expressions remain, and people argue day by day on talk.commonmark.org.

Back to the reference link completion, which was introduced in early 2019, and have not been maintained for years, it also suffers from flawed presumptions. Additionally, when and how to show completion is also implicitly in question. Maybe your idea (show as soon as hitting [) is right.


There is a piece of history.

From the beginning, we match raw content against regular expressions to analyze documents. You can still find some regexp dating back to the first release. I remember somewhere, Yu Zhang told me that we'd better support the most easy-to-use and common cases first, and ignore those corner cases until someone reports them.

That was fine before version 3.0.0, when nominally only GFM was supported, and too complex things could be confidently delayed. Besides, around version 1, this product was actually more like an attachment of VS Code's vscode.markdown-language-features.

However, with #660 introducing slugifyMode, and #658 enabling loading VS Code markdownItPlugins extensions, it began to explicitly suit a much wider range of users. A good number of surprising problems arose soon. The tip of the iceberg:

Naturally, we turned to try markdown-it. I did a few experiments, but it appears hardly capable of powering editing:

  • It has a concept called core rule at parsing stage, so that things like markdown-it-emoji can distort the input without leaving any trace.
  • Undoubtedly, source mapping for inline structures is impossible, and corresponding Tokens can only represent information that a render needs and receives.
  • Moreover, it does not even provide precise source mapping for block structures under some circumstances. We have to check blank lines twice. We have to search for the original position of a reference definition. ...
  • The content property of Token is not consistent. Sometimes it's raw, sometimes it's converted, and the behavior does not strongly correlate with the token type.

Finally, I guess the way out is to build a standalone system from scratch. It's evidently complex itself, not to mention that we would probably need a shim layer to maintain compatibility with markdown-it. 😂


Do you want to raise an issue separately for tracking/discussion on this?

I feel there is already a thread. Let's check the issue tracker.

@gavbarnett
Copy link
Contributor Author

That's a lot to go through, so i will digest it further this evening.

I'm not fully into the code yet so take my suggestions here as a n00b possibly talking rubbish.
Looking at the completion code i don't think the regex itself is an issue, but there is significant repetition that could be split out. Doing that would make testing easier I feel.
Similarly the large if else structure should probably be broken down into function calls IMO.

I've fixed this bug in my local repo (need to push to github) but it sounds like I'll need to spend some time testing various edge case's.

@yzhang-gh
Copy link
Owner

yzhang-gh commented Feb 6, 2021

Thank you both for the discussion ❤.

I agree (as @Lemmingh said) that many regexp-based approaches are now not capable to deal with more and more Markdown "exceptions". And we perhaps need to turn to a syntax parser in the future, although it is a long way to go.

For now, I think it is acceptable to include the simple "patch" from @gavbarnett as it does help some users (Good) and in the same time it doesn't introduce any difficulties to our future refactoring (Not bad).

(The only concern is that people will see some completions when they type the first [. It may be a bit confusing for those who don't know about "shortcut reference links".)

@gavbarnett
Copy link
Contributor Author

Will try to get a pull request in soon.

The other odd side effect is when creating check box list's - [ ] it offers a reference link completion. I was going to escape this but it is acceptable to have a list of links so it needs to stay.

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 6, 2021

The other odd side effect is when creating check box list

As long as the false positive is not very annoying, I feel it'll be OK.

After all, I don't expect "match raw content against regular expressions" to give accurate result.

@yzhang-gh
You idea?


It may be a bit confusing for those who don't know about "shortcut reference links".

We need statistics, which is not possible now. By counting, we can learn a user's preference, and only show completion for the syntax they like.

For now, my suggestion is similar to #893 (comment):
Check full and collapsed link with a loose test first, then check shortcut link with a stricter test. Given that shortcut reference link is indeed an alien, we'd better only recognize it when

The bracket pair is surrounded by spaces.
↓   ↓
 [ ]
  ^
  Cursor is between a bracket pair.

This might mitigate some false positive within paragraphs.


I see a lot of \s. However, the sad fact is the CommonMark whitespace is different from the ECMAScript white space. So, you need [ \t\r\n\f\v].

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 6, 2021

Another thought:

We can do some refactoring, then show completion for shortcut reference link only on explicit request (triggerKind === Invoke), or at least !== TriggerCharacter.

@yzhang-gh
Copy link
Owner

As long as the false positive is not very annoying, I feel it'll be OK.

After all, I don't expect "match raw content against regular expressions" to give accurate result.

Agree

Check full and collapsed link with a loose test first, then check shortcut link with a stricter test.

Sounds good, although I guess it won't help a lot? (When we see the first [], we don't know whether it will be a [][] or []().)

Another thought:

We can do some refactoring, then show completion for shortcut reference link only on explicit request (triggerKind === Invoke), or at least !== TriggerCharacter.

It is an option. The thing is I don't think this feature is worth the effort 🤣.

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 6, 2021

I guess it won't help a lot?

Right, the difference between collapsed and shortcut is just a trailing []. So, checking twice should make no sense.

And you remind me of "Beyond Markdown", where jgm listed reference link as one of the six pain points. 😂

(Edited by @yzhang-gh: the link to Beyond Markdown)

@gavbarnett
Copy link
Contributor Author

gavbarnett commented Feb 6, 2021

I see a lot of \s. However, the sad fact is the CommonMark whitespace is different from the ECMAScript white space. So, you need [ \t\r\n\f\v].

This is exactly where I think breaking down the regex into chunks would be handy for re-use.

Naive example code (This would need some thought to iron out the kinks, such as "optional whitespace (including up to one line ending)"):

/* regex sub-patterns */
//these could be either strings or actual regex.
reg = {
  "whitespace" = "[ \t\r\n\f\v]",
  "url" = "[TBD]",
  "linkLabel" = "[TBD]",
  //etc.
}

//build a pattern for checking
"regLinkRefDefinition" = regex_concat(
      reg.whitespace, "{0-3}", 
      reg.linkLabel, ":", 
      reg.whitespace, "{0-1}", 
      reg.url, 
      "(", reg.whitespace, reg.title, ")?"
    ) //this is probably invalid regex syntax, but just showing the idea. It makes the intent more explicit at least.
  /*A link reference definition consists of a link label, indented up to three spaces, 
  followed by a colon (:), optional whitespace (including up to one line ending), 
  a link destination, optional whitespace (including up to one line ending), 
  and an optional link title, which if it is present must be separated from the link destination by whitespace. 
  No further non-whitespace characters may occur on the line.*/

Regex doesn't seem to natively support concatenation but there are plenty of example on stack exchange for how to do this. It basically sorts out some escape characters joins the strings and converts to a new regex.


For now, my suggestion is similar to #893 (comment):
Check full and collapsed link with a loose test first, then check shortcut link with a stricter test. Given that shortcut reference link is indeed an alien, we'd better only recognize it when

Not sure I follow? Why would it require a space before hand? I think if its valid CM then the autocompletion should be offered and the user can just type something else, don't think it really get's in the way.


In my view there is a difference between editing (where some guessing is required as mostly you need to be fast and work line by line) and syntax validation/html rendering (where you can read the full file and construct a tree as you say). While editing you're never sure which branch of the tree the user is about to take so you have to offer all possible options (ranked in some way).

@yzhang-gh
Copy link
Owner

breaking down the regex into chunks

Just to add that we did the similar thing for GFM table regexp

private detectTables(text: string) {
const lineBreak = String.raw`\r?\n`;
const contentLine = String.raw`\|?.*\|.*\|?`;
const leftSideHyphenComponent = String.raw`(?:\|? *:?-+:? *\|)`;
const middleHyphenComponent = String.raw`(?: *:?-+:? *\|)*`;
const rightSideHyphenComponent = String.raw`(?: *:?-+:? *\|?)`;
const multiColumnHyphenLine = leftSideHyphenComponent + middleHyphenComponent + rightSideHyphenComponent;
//// GitHub issue #431
const singleColumnHyphenLine = String.raw`(?:\| *:?-+:? *\|)`;
const hyphenLine = String.raw`[ \t]*(?:${multiColumnHyphenLine}|${singleColumnHyphenLine})[ \t]*`;
const tableRegex = new RegExp(contentLine + lineBreak + hyphenLine + '(?:' + lineBreak + contentLine + ')*', 'g');
return text.match(tableRegex);

@gavbarnett
Copy link
Contributor Author

gavbarnett commented Feb 7, 2021

That's good,
I notice there is a MarkdownSpec.ts which looks nice as a place to put common expressions? But I assume you have plans for that file.

I've made a start at listing the regex from the spec as best I can, it's a long slog this method (my regex skills are slowly improving). I've been breaking it up per section in the spec for now and making each expression bounded in brackets so they can be combined when needed. (This is not directly related to this issue so I might keep it separate when I do a pull request). I'm especially happy with cmAtxHeadings for now 😄.

Clearly this doesn't solve all problems (just got to Setext headings 😱, surely only crazy people that want to watch the world burn would use them!) but it could be a nice foundation to work from.

/*
        CommonMark 0.29 regexList
        use with "u" flag, i.e. /pattern/gu 
        */
        const cmPre = {
            //uses Unicodes and unicode types to be as close to spec as possible
            //Characters and lines
            "lineEnding": String.raw`(\u000A|\u000D|\u000D\u000A)`, 
            "whitespace": String.raw`[\u0020|\u0009\|\u000A|\u000B|\u000C|\u000D]`,
            "unicodeWhitespace": String.raw`[\p{Z}|\u0009|\u000D|\u000A|\u000C]`,
            "space": String.raw`[\u0020]`,
            "spaceOrTab": String.raw`[\u0020|\u0009]`,
            "asciiPunctChar": String.raw`[\u0021-\u002F|\u003A-\u0040|\u005B-\u0060|\u007B-\u007E]`, 
            "punctChar": String.raw`[\u0021-\u002F|\u003A-\u0040|\u005B-\u0060|\u007B-\u007E|\p{Pc}|\p{Pd}|\p{Pe}|\p{Pf}|\p{Pi}|\p{Po}|\p{Ps}]`,
            //Tabs
            "tab": String.raw`[\u0020\u0020\u0020\u0020|\u0009]`,
            //Insecure characters
            "insecureChar": String.raw`[\u0000]`
        }
        const cmBlocks = {
            "thematicBreak": String.raw `(^${cmPre.space}{0,3}((\*${cmPre.spaceOrTab}?){3,}|(-${cmPre.spaceOrTab}?){3,}|(_${cmPre.spaceOrTab}?){3,})${cmPre.spaceOrTab}*)`
        }
        const cmAtxHeadings = {
            "hClosing": String.raw `((${cmPre.spaceOrTab}+[#]+${cmPre.spaceOrTab}*)?$)`,
            "hContent": String.raw `((.*?)?)`, //suspect this could be better
            "h": String.raw`(^${cmPre.space}{0,3}[#]{1,6}${cmPre.spaceOrTab}+)`,
            "h1": String.raw`(^${cmPre.space}{0,3}[#]{1}${cmPre.spaceOrTab}+)`,
            "h2": String.raw`(^${cmPre.space}{0,3}[#]{2}${cmPre.spaceOrTab}+)`,
            "h3": String.raw`(^${cmPre.space}{0,3}[#]{3}${cmPre.spaceOrTab}+)`,
            "h4": String.raw`(^${cmPre.space}{0,3}[#]{4}${cmPre.spaceOrTab}+)`,
            "h5": String.raw`(^${cmPre.space}{0,3}[#]{5}${cmPre.spaceOrTab}+)`,
            "h6": String.raw`(^${cmPre.space}{0,3}[#]{6}${cmPre.spaceOrTab}+)`
        }
        // Example isolating hContent (the title string) from any header removing spaces & tabs:
        // pattern = String.raw`(?<=` + h + String.raw`)` + 
        //           hContent +
        //           String.raw`(?=` + hClosing + String.raw`)`

here's what that final example pattern should look like:
(?<=(^[\u0020]{0,3}[#]{1,6}[\u0020])+)((.*?)?)(?=(([\u0020]+[#]+[\u0020]*)?$)), I've been testing my regex's here: https://regex101.com/

@yzhang-gh
Copy link
Owner

This is not directly related to this issue so I might keep it separate when I do a pull request

That will be good as we plan to release a new version very soon.


As for the regex expressions, I think we can have a standalone utility file for it (as it doesn't look to be a part of the Markdown spec).

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 7, 2021

breaking down the regex into chunks would be handy for re-use.

these could be either strings or actual regex.

  • I like the idea storing regexp chunks. Other parsers also keep frequently used regexp as constants.
  • I prefer using regexp to match delimiters instead of a whole structure.
  • I prefer to store regexp literal rather than concatenate strings.

Markdown is designed to be natural language with formatting marks, like those ancient manuscripts. Its mysterious rules are designed to mimic how a human reads a piece of writing.

Parsing Markdown is more like analyzing natural language. Although I'm not familiar with natural language processing, I can see Markdown's parsing strategy (block-inline) is hierarchical and recursive from the beginning. It is different from the lexical-semantic workflow that we use to parse programming languages. When working with normal programming languages, we divide source code into tokens in the first phase, and can predict what should be next as soon as we start. While with Markdown, we had better take node as basic unit, and it's still difficult to infer what could be there even after analyzing the whole document.

Matching a whole structure is exactly matching raw content, which has proved to be full of limitations, and is why the current implementation is so error-prone. If we're talking about patching existing code, I don't mind. However, #893 (comment) and #893 (comment) appear to be of compiler design. Then, I am against it.

Writing regexp literals is clearer and easier to reveal errors, although it limits many advanced possibilities. Concatenating strings to build regexp is honestly useful in many scenarios. However, it requires extreme carefulness due to an unsolvable hazard, which is also the biggest reason that TC39 rejected the Escaping Proposal. I do not think it's worth reviewing a short piece of code again and again just to ensure its runtime behavior is correct.

Personally, I wish there could be a Concatenation Proposal for manipulating RegExp internal tree, but guess TC39 will reject it for complexity.


Not sure I follow? Why would it require a space before hand?

Never mind, I was daydreaming.

I thought since shortcut reference link is dangerous, we can make it hard to appear. However, with only a [], the user's intention is unclear. It can be any kind of reference link, inline link, literal, and so on.


editing (where some guessing is required as mostly you need to be fast and work line by line) and syntax validation/html rendering (where you can read the full file and construct a tree as you say)

Embarrassingly, Markdown is strange as said above. CommonMark forces you to scan the entire document at least twice to know its content. Indeed, some of our current functions implicitly scan the document four or five times in a run, which IMO, is unnecessary cost.

Even after that, you cannot be confident enough about what is going on, because it's Markdown. Thus, you are right, how to guess is an art.

CommonMark regards everything as valid. A sequence that does not fit into known syntax is textual content. This is where most difficulties lie in.


While editing you're never sure which branch of the tree the user is about to take so you have to offer all possible options (ranked in some way).

Agree.

@gavbarnett
Copy link
Contributor Author

CommonMark regards everything as valid. A sequence that does not fit into known syntax is textual content. This is where most difficulties lie in.

Agree. Wondering if the extension should help inforced preferred style guide then? Doesn't really help with completion though.

I'd not looked into TC39 escape proposal before, makes for interesting reading. But we could implement an insecure function for this/for concatenation if we're carefully about the regex chunks. Agree using regex instead of strings is better.

This is all an interesting discussion but I'm aware we've gone down a few rabbit holes from the initial [ ][ ] vs [ ] issue. I'll try to make some effort to create separate enhancement issues this evening so its easier to track.

@Lemmingh
Copy link
Collaborator

Lemmingh commented Feb 7, 2021

help enforce preferred style guide

markdownlint is a dedicated style checker. Perhaps we don't need to repeat it.


I'll try to make some effort to create separate enhancement issues this evening so its easier to track.

Do you mean parser design? We can discuss and manage them at https://github.com/markdown-all-in-one/markdown-language-service.

@gavbarnett
Copy link
Contributor Author

#895 created to discuss regex chunking
#896 created to discuss splitting the large provideCompletionItems function.

@gavbarnett
Copy link
Contributor Author

gavbarnett commented Feb 7, 2021

help enforce preferred style guide

markdownlint is a dedicated style checker. Perhaps we don't need to repeat it.

I'll try to make some effort to create separate enhancement issues this evening so its easier to track.

Do you mean parser design? We can discuss and manage them at https://github.com/markdown-all-in-one/markdown-language-service.

Sorry, missed that!, just shows how long this browser tab's been open!

Agree - don't reinvent the wheel

Ok - will hop over there. I think the above 2 issues can remain here for now as they have merit anyway and are relatively straightforward. But a full parser design maybe the better longer term

@gavbarnett
Copy link
Contributor Author

Is there a set of tests for auto completion in the code already? If so how do i use it?

If not I'll just test manually and list my tests.

@yzhang-gh
Copy link
Owner

No, so we need to test it by ourselves.

@gavbarnett
Copy link
Contributor Author

Sorry this is taking a while, only got a few hours in the evenings for it.

I'm currently trying to tighten up the regex to more accurately match in a few places. Getting there! I think I might have to change some bits to a regex with multiple (& global) flags rather than line by line as link ref definitions can be multiline.

Currently able to match all but 2 examples of link reference definition in the common mark spec. One i think i can solve (link ref definition within code blocks should be ignored), the other (link ref definition with whitespace line break in the title text should be ignored) i may leave as a problem for the next person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Auto completion Code completion and suggestion, aka IntelliSense. Help wanted Looking for help. Issue: Enhancement Improvements in existing features. Res: Fixed Fix is checked in, but it might be a few weeks until a release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants