Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide source information to caller during parsing #91

Open
niblo opened this issue Sep 7, 2019 · 31 comments
Open

Provide source information to caller during parsing #91

niblo opened this issue Sep 7, 2019 · 31 comments

Comments

@niblo
Copy link
Contributor

niblo commented Sep 7, 2019

As mentioned previously by @mity:

[...] There were some other feature requests in the past that would allow to use MD4C for e.g. a syntax highlighting of Markdown format in a text editor. So far it has not be implemented, but there is a dummy never-called callback (MD_PARSER::syntax()). (It even does not yet have a clear function prototype.)

My preliminary idea was that when (in the future) an app sets the callback non-NULL the callback would be called during the parsing in order to inform the app about things like "Here at the offset 1234, an inline link starts. Here at offset 2000, an inline URL starts, Here at offset 2010 it ends etc."; or maybe it could rather work in the terms of ranges rather then begin/end events.

The primary motivation was to allow app to do syntax highlighting of a Markdown source, [...]

I will make an additional suggestion that the current line number is useful information too. (It just crossed my mind that it may be well enough to have the offsets, and no line numbers. Text editors usually have a method to place the cursor at some specified offset from 0.)

For another useful application of this, see http://moinmo.in/WikiSandBox. Double click somewhere in the text, and you will be taken to the editor with the cursor placed at the start of the corresponding line. If you look at the HTML source of the rendered text, you will see that it has hidden line references.

@mity
Copy link
Owner

mity commented Sep 8, 2019

For another useful application of this, see http://moinmo.in/WikiSandBox.

Yes, it could have quite a lot of applications. But so far noone was motivated enough to really work on it...

I have some idea how the feature might be implemented, but I actually never started to work on it because I think it would be better to do so only when there is some real development of the application at the same time so the API gets some feedback/verification whether the ideas are solid.

Prototype of the callback could look like this:

int syntax(MS_SIZE off, MD_SIZE size, unsigned what, void* detail, void* userdata);

Parameters:

  • off and size would determine some range of offsets in the source text the callback info applies to,
  • what would likely be the most important (see below),
  • detail would be only reserved for future (always NULL unless we have some use for it),
  • userdata would be propagated user context from md_parse() as other callbacks already do it.

The int return value would allow app. to abort the parsing as with other callbacks.

The what would be likely a bit mask, where:

  • a bit would encode whether it is about "block"
  • a bit would encode whether it is about "inline".
  • a bit would encode whether it is about "text" flow. (e.g. an escape sequence)
  • a bit would encode whether it is a "value" contents (e.g. a link URL) or special character used to denote the syntax (e.g. # for headings or []() for links.)
  • some (reasonable) bitset to specify the corresponding MD_SPANTYPE (if inline) or MD_BLOCKTYPE (if block) or MD_TEXTTYPE (if text) used in the other callbacks.
  • some (reasonable) bitset to distinguish parts of the sytax construction (e.g. "a link destination" versus "a link title") in the case of the value contents.

The public header would provide some list of preprocessor macros for all the potentially fired what-events.

Some more notes come to my mind:

  1. Imho the callback likely wouldn't guarantee any order of the calls (e.g. as sorted by offset).

    Rationale: MD4C uses two big passes (blocks versus inlines) and that cannot be easily changed (the parser must know all reference definitions in the document before it can resolve links). Similarly the inlines are resolved in separate loops over the collected potential marks (the priority rules).

  2. There might be multiple calls for a single Markdown syntax entity. For example for input like this:

    ... [foo](http://example.com "link title") ...
    

    we could generate many calls:

    • complete inline link (covering whole range of the link),
    • many calls for the brackets and quotes denoting parts of the link so the editor can highlight those chars,
    • link for the textual attributes (link url range, link title range),
    • (of course, additionally there would be some calls for stuff inside the links where appropriate. Here the emphasis in the link text or the escape sequence in the title.)
  3. For the performance reasons, the callback likely would not be called at all for most of the source contents (i.e. the "normal" text).

  4. Once we decide the what bit encoding, it could then mostly all be implemented incrementally across many releases as applications need it. Until it gets some good feature parity with the other callbacks or until we get good confidence it is the right approach, we could mark it as an experimental part of the API and ignore some compatibility along that time period.

@niblo
Copy link
Contributor Author

niblo commented Dec 8, 2019

I re-read this and it occurred to me that the syntax callback does not make it possible to correlate offsets in the text with elements in the rendered text. Is that correct @mity?

@niblo
Copy link
Contributor Author

niblo commented Dec 8, 2019

(Not saying it's not useful - because I really think it is - but I think I made an error in thinking about that in particular.)

@mity
Copy link
Owner

mity commented Dec 8, 2019

I re-read this and it occurred to me that the syntax callback does not make it possible to correlate offsets in the text with elements in the rendered text. Is that correct @mity?

Correct.

And I can't see how that could be (easily) done if you need it. The parser has no knowledge how much text the renderer shall output, given the output format syntax decoration (e.g. html tags) or escape rules.

All what can theoretically be achieved is that it would be called at the "right time" so that when called, the renderer knows "now, at the current position in the output, it corresponds to this and that". That would mean we would not call the callback when we detect something but only during the time when all the rendering callbacks are called.

The drawback of that approach would be larger memory consumption as we would have to remember more data (the offsets in the input at least) for all the stuff.

Another consequence would be that we could not emit ranges but beginning and end events because "the right time" is a different thing for them. That would complicate "simple" applications who may just need the corresponding beginning and end of some stuff in the input to e.g. colorize the Markdown syntax.

And finally, there would be unsolvable limits of that approach: E.g. reference link definitions have no direct counterpart in the rendered text at all. They only affect all those reference links distributed arbitrarily in the document, which match its label.

@mity
Copy link
Owner

mity commented Dec 8, 2019

Just got an idea that maybe just a guaranty that the events come in some defined order (e.g. as ordered by beginning of the range) within the set of events of that particular type would suffice.

E.g. that the order of paragraphs as provided by syntax() would be the same as the order in which enter_block() is called.

This would allow the application to build a mapping between offsets in the input and in the output if it needs them, as the renderer can simply count how many paragraphs it has already seen and use that counter as an ID of all the paragraphs.

Ditto for all the other block/inline elements.

I know it's kind of "leave the hard work to the caller" way but maybe it is the right one here.

@niblo
Copy link
Contributor Author

niblo commented Dec 8, 2019

The parser has no knowledge how much text the renderer shall output, given the output format syntax decoration (e.g. html tags) or escape rules.

That I understand, I think. What about "for a given rendered block/inline, which are its offsets in the source text"? The offsets in the output are not important, only that each rendered something has offset references to their own source.

@mity
Copy link
Owner

mity commented Dec 8, 2019

IDK.

I can theoretically imagine that maybe all those detail structures would have some source_off member(s).

But:

  1. It would be quite a lot of work to make it remember it at the detection to make it available at the rendering time. We would also have to introduce a bunch of new detailed structures in the public API for all the "trivial" elements that haven't needed any so far.

  2. For some block/inline types it would be quite a lot of data. Consider e.g. links or images. Some application may want to know where the brackets marking the links are, other application might want to have the access to the link destination or title position in the source (which may live completely elsewhere, in a link ref. def.)

    Or e.g. fenced code blocks with the info strings which may have a ton of offsets: Where the beginning fence starts, where it ends, where the info string starts, and where it ends, where the contents start and where it ends, where the end fence starts and where it ends.

  3. I am afraid of added memory footprint and performance costs for applications which do not need it. (Which is quite likely vast majority.)

@niblo
Copy link
Contributor Author

niblo commented Dec 8, 2019

Yes, you are of course right on those points.

I think doing work on the syntax callback feature can provide more insight to this "source offset".

Do you think it would be interesting for only certain types, like tasks? Maybe a compile-time flag could toggle this feature, to avoid the extra overhead.

@mity
Copy link
Owner

mity commented Dec 8, 2019

For tasks, indeed, it is extra important as the app may want to change status of it so it needs to know where the check char in the input lives. But that one already has it (see task_mark_offset), exactly for that purpose.

@niblo
Copy link
Contributor Author

niblo commented Dec 8, 2019

That was a nice find. I guess I haven't looked into the tasks feature enough. Thank you for mentioning it. That satisfies my particular needs for now with regards to source offsets.

Then I think it would also be appropriate to add syntax() support for tasks. I have a need for this, so I will try and come up with something useful.

@mity
Copy link
Owner

mity commented Dec 8, 2019

I think doing work on the syntax callback feature can provide more insight to this "source offset".

Well, I hope that if we add syntax() callback in the right way, we could avoid the "source offset" in almost all cases. The only exception would likely be special cases with much bigger need then e.g. syntax coloring for it. Like the interactivity support in the case of the task lists right now.

Maybe a compile-time flag could toggle this feature, to avoid the extra overhead.

No. It seems we are being added to more and more linux distros and compile-time option would effectively mean that it cannot be turned on/off per app anyway.

But I believe that if all the bigger data is provided via the syntax callback, run-time branching (if(parser->syntax != NULL)) for all the hard work is good enough. Because parser->syntax remains constant during all the document processing, CPU branch predictor and speculative execution should be our friends even on the most lame CPU models.

@niblo
Copy link
Contributor Author

niblo commented Dec 9, 2019

Ok, then we aim for your idea about syncing the callback order.

@karstenBriksoft
Copy link
Contributor

I think it's quite problematic to separate the sax-functions and the syntax information. Instead the syntax information should be part of the sax functions. I've implemented a proposal in karstenBriksoft@ec1d181 but the critical part is still missing: the correct offset. For the changes to work the context has a new off member but it's currently always 0.
The reason for the new member is that the macros to call the parser's callback-functions only have small changes by adding ctx->off as last parameter.
The basic idea of the change is to introduce a new MD_PARSER_EX structure and corresponding function that has callbacks with one additional parameter each: the offset in the string where the parser is when this was detected. So for blocks and spans that's either the beginning or end on enter or leave respectively.
I've kept the original MD_PARSER structure and created a fallback implementation that converts the MD_PARSER to an MD_PARSER_EX so that there're not two separate implementations of the whole thing.

My idea for an initial implementation was to only use MD_OFFSET as offset, so the offsets that are passed to the callback would be byte-offsets in the source string. Providing line/column information would require additional computation of line/column to offset and back. Maybe this can be solved by introducing a new callback like (*line_break)(userdata, off). That way the parser's idea of where line-breaks are can be kept in sync with the calling application.

@mity
Copy link
Owner

mity commented Aug 20, 2020

I think it's quite problematic to separate the sax-functions and the syntax information.

At a first glance, I would tend to agree. Actually, it was my original idea too to provide the syntax info in the SAX callbacks, too. But after some more thinking it would also bring some hard-to-solve or maybe even impossible-to-solve problems, for the implementation as well for the interface itself.

Most of them come from the fact that most likely application interested in it are text editors and that those would want much richer information about the syntax used, e.g. exact position of every bracket-like or quote-like character used to encode an inline link (example: [link label](<http://example.com> 'link title')), to highlight them.

So the following is a (possibly incomplete) list of potential problems with such approach:

  1. The SAX-callbacks are called in the right order, as the various syntax constructions appear in the source. Yes, this sounds like an advantage for syntax callback too; that is, from the consumers POV. But consider that it's not necessarily the order in which the stuff (especially the inline stuff) is recognized. We generally have (and always will have) the full information about the syntax in the function where it is being actually recognized. So either we would have to expand (in order of magnitude(s)) what we store in MD_CTX::block_bytes[] and MD_CTX::marks[] to propagate the info to much later stage when we fire the SAX callback, or the syntax info would be extremely limited indeed.

  2. The syntax construction recognition -> SAX-like callback is a transition which almost by definition loses some information. Consider for example links, which can be encoded as inline links, reference links, auto-links, permissive autolinks. Some of those are quite variable too, so for example link destinations of an inline link may or may not be enclosed in < and >.

    As a result, MD_SPAN_A_DETAIL would have been expanded to very monster structure with union to accommodate all of those (and any possible future expansion).

    Other example may be setext versus ATX header, intended clode block versus fence code block, and possible many others.

    Application converting Markdown to something else is usually not interested to know/care how the link or the header is encoded. In contrast, a text editor with some syntax highlighting feature may want to know about position of every special character used to encode that syntax construction.

  3. There would still be stuff which is in a way out of the right order, namely the link reference definitions. There is no SAX-like callback for them. They are only used silently when referred by some link, yet from any application interested in the source information, they are important.

  4. By using the SAX callbacks, we would close the gate to a possibility to report things like: "Hey, this thing looks as a link reference, but it actually is not because there is no link reference definition with the matching label. I argue that text edtiors might be heavily interested in that to highlight some possible error or something.

  5. Consider also particularly nasty things coming from the fact that inline syntax constructions may be broken into multiple lines and those lines may also contain some enclosing block decorations:

    >>>> [This is a link broken
    >>>> into multiple
    >>>> lines and nested in
    >>>> a (deeply nested chain of) 
    >>>> blockquote(s)](http://example.com
    >>>> 'title')
    

    The consumer must get some sensible information about both what to highlight as a block decoration and what to highlight as a syntax encoding a link.

  6. Other similarly nasty troubles might come e.g. from tables: their syntax is also scattered over multiple lines.

@karstenBriksoft
Copy link
Contributor

Most of them come from the fact that most likely application interested in it are text editors and that those would want much richer information about the syntax used, e.g. exact position of every bracket-like or quote-like character used to encode an inline link (example: [link label](<http://example.com> 'link title')), to highlight them.

My thinking was more along the lines of: md4c is built for converting Markdown to something else. If there's some rough syntax information provided, that's at least a good starting point.
If in you Link-Example the position of the first [ and the position of the last ) are communicated via block_enter and block_leave then the editor knows that in that text-span it'll find a link. It can then use pattern matching or similar techniques to find out the specifics (the link-specific data is already provided by md4c, which makes the search even easier)

  1. The SAX-callbacks are called in the right order, as the various syntax constructions appear in the source. Yes, this sounds like an advantage for syntax callback too; that is, from the consumers POV. But consider that it's not necessarily the order in which the stuff (especially the inline stuff) is recognized. We generally have (and always will have) the full information about the syntax in the function where it is being actually recognized. So either we would have to expand (in order of magnitude(s)) what we store in MD_CTX::block_bytes[] and MD_CTX::marks[] to propagate the info to much later stage when we fire the SAX callback, or the syntax info would be extremely limited indeed.

That was my impression, too. That's why the callbacks I defined don't yet provide the correct offsets. As for expanding the structures, i'd start with just the beginning- and end-offsets.

  1. The syntax construction recognition -> SAX-like callback is a transition which almost by definition loses some information. Consider for example links, which can be encoded as inline links, reference links, auto-links, permissive autolinks. Some of those are quite variable too, so for example link destinations of an inline link may or may not be enclosed in < and >.
    As a result, MD_SPAN_A_DETAIL would have been expanded to very monster structure with union to accommodate all of those (and any possible future expansion).
    Other example may be setext versus ATX header, intended clode block versus fence code block, and possible many others.
    Application converting Markdown to something else is usually not interested to know/care how the link or the header is encoded. In contrast, a text editor with some syntax highlighting feature may want to know about position of every special character used to encode that syntax construction.

As the possibilities in Markdown are close to endless, I'd not even try to provide the position of every character. It makes the data structures complicated to define, it makes them complicated to fill and on the other end it also makes them complicated to read and understand. The editor would need to be aware of every possibility, maybe without even wanting to support all of them.

  1. There would still be stuff which is in a way out of the right order, namely the link reference definitions. There is no SAX-like callback for them. They are only used silently when referred by some link, yet from any application interested in the source information, they are important.

That's where I actually see use for the syntax() callback that you defined in the parser: Make the syntax() callback like an alternative parsing mode, where the parser can be informed "on the fly" about what md4c thinks might be there, with no guarantee that it's actually there because the data was invalidated later.

  1. Consider also particularly nasty things coming from the fact that inline syntax constructions may be broken into multiple lines and those lines may also contain some enclosing block decorations:

    >>>> [This is a link broken
    >>>> into multiple
    >>>> lines and nested in
    >>>> a (deeply nested chain of) 
    >>>> blockquote(s)](http://example.com
    >>>> 'title')
    

    The consumer must get some sensible information about both what to highlight as a block decoration and what to highlight as a syntax encoding a link.

According to my idea about only providing start/end offsets, I'd say position 0 starts with the first >, opening a block-quote, then at position 5 the [ marks the beginning of the link. Then the last position at ) marks both the end of the link and the end of the block-quote.

  1. Other similarly nasty troubles might come e.g. from tables: their syntax is also scattered over multiple lines.

Also here, i'd keep it simple and only provide the start and end offsets (haven't actually tried what md4c provides as callbacks inside tables, but i'd say the strings are at least reported separately so their offsets would also be reported, which should give you a pretty good idea about where the table is located and what it consists of).

@mity
Copy link
Owner

mity commented Aug 20, 2020

@karstenBriksoft Could you maybe briefly explain what your use case is? It likely is not syntax highlighting, or is it?

@karstenBriksoft
Copy link
Contributor

It is syntax highlighting, but on a very basic level. Consider an editor like iA writer: You see the Markdown but the Markdown has styles applied so that you get an idea of what you're typing.
So if you type a heading line, the whole line gets bigger, if you type a link, the whole link changes its color. For applying basic styles to a Markdown source code there's no need for super fine grained syntax information. I honestly don't see the benefits of having this detailed information other than extracting the payload information from it. But the payload was already extracted by md4c and provided via callback.

@mity
Copy link
Owner

mity commented Aug 20, 2020

I honestly don't see the benefits of having this detailed information other than extracting the payload information from it.

I understand your approach may be enough for some simple syntax highlighting as yours.

But at the same time I still believe that providing a way to highlight > in one color when used to denote the blockquote and allow to highlight all bracket-like things for link syntax in another color, and allow to underline anything used as an URL (links, images, reference definitions) are all very valid and natural features the syntax API should provide for more demanding consumers.

So, give me some time to think it more over to see, whether the SAX-like approach could/should be expanded to accommodate both approaches at the same time or whether it would be a road to a maintenance hell.

@karstenBriksoft
Copy link
Contributor

But at the same time I still believe that providing a way to highlight > in one color when used to denote the blockquote and allow to highlight all bracket-like things for link syntax in another color, and allow to underline anything used as an URL (links, images, reference definitions) are all very valid and natural features the syntax API should provide for more demanding consumers.

Like I said: if you want to highlight code like [link label](<http://example.com> 'link title') it's super easy if you know the start and end, because you also know the title and the href as they're both part of the MD_SPAN_A_DETAIL information. So instead of having to provide information about the control characters [..](<...> '..') you already provide the inverse information ..link label...http://example.com...link title..., which is perfectly fine and more than enough to create excellent highlighting.

@mity
Copy link
Owner

mity commented Aug 20, 2020

Like I said: if you want to highlight code like [link label](<http://example.com> 'link title') it's super easy....

No, it is not. You cannot easily distinguish these from twhat you get into the callback, for example:

<https://example.com>

versus

[https://example.com](https://example.com)

versus

[https://example.com](<https://example.com>)

Or

[foo](https://example.com)

versus

[foo]: https://example.com

[foo]

Or you cannot automatically treat > as a special link character just because it is between a link beginning and end, because it may be a decoration of a blockquote the link is nested in.

And last but not least, you force the application to understand the Markdown specification and reimplement what the parser does, with all the maintenance burden it may bring if e.g. a new link type is added tomorrow.

@niblo
Copy link
Contributor Author

niblo commented Aug 20, 2020

... I've implemented a proposal in karstenBriksoft@ec1d181 but the critical part is still missing: the correct offset. For the changes to work the context has a new off member but it's currently always 0.

But this is not a proposal at all. Why don't you make an attempt at implementing it instead. Only then will we start to see where the real issues are. There are so many unknowns about how this will work in practice.

@karstenBriksoft
Copy link
Contributor

@niblo

... I've implemented a proposal in karstenBriksoft@ec1d181 but the critical part is still missing: the correct offset. For the changes to work the context has a new off member but it's currently always 0.

But this is not a proposal at all. Why don't you make an attempt at implementing it instead. Only then will we start to see where the real issues are. There are so many unknowns about how this will work in practice.

it's a proposal of how an alternative API could look like. md4c is not exactly super easy to understand in an hour or so, which makes it hard to "quickly" add some offset information to the parser. The implementation is highly optimised for speed, not for comprehension.

@mity
Copy link
Owner

mity commented Aug 20, 2020

It is better syntax highlighting than you are clearly having in mind, and it should offer the following features:

  • Allow highlighting of special characters used to denote even a complex syntax construction (e.g. all the meaningful | in the tables and the special header line underline), or the meaningful brackets, parenthesis etc. in links, > in bock quotes etc., all of that if and only if they form part of that syntax construction and are not part of a normal text flow.

  • Allow treating all URLs as URLs (e.g. to underline them and make them clickable) without forcing applications to know where URLs may or may not appear in the Markdown syntax or how any of those syntax constructions may or may not be encoded.

    Generally, if a new link type is added into the specification tomorrow and MD4C gets updated to support it, application should not be required to do any code change in order to highlight the URL in the new link type.

  • Allow to syntax highlight even stuff which gets invisible in a normal rendering output or when converting to HTML. (Now mainly the link reference definitions. But people often calls for other out-of-pace features to be added into Markdown like footnotes)

  • Do not enforce the application to reimplement parser for Markdown or any part of it if they want to achieve any of these.

@karstenBriksoft
Copy link
Contributor

Like I said: if you want to highlight code like [link label](<http://example.com> 'link title') it's super easy....

No, it is not. You cannot easily distinguish these from twhat you get into the callback, for example:

<https://example.com>

versus

[https://example.com](https://example.com)

versus

[https://example.com](<https://example.com>)

Or

[foo](https://example.com)

versus

[foo]: https://example.com

[foo]

Or you cannot automatically treat > as a special link character just because it is between a link beginning and end, because it may be a decoration of a blockquote the link is nested in.

From what I understand, the parser sends the callbacks in the correct order. So if a link is part of a block quote, you'll get the information about the block quote first. That allows you to apply styling to the block quote, given you know its location. Then you get the information about the anchor span, allowing you to apply the appropriate format in its span. Lastly you get the information about text inside the anchor, allowing you to apply the formatting for the text in an anchor in a block quote.

And last but not least, you force the application to understand the Markdown specification and reimplement what the parser does, with all the maintenance burden it may bring if e.g. a new link type is added tomorrow.

That's why i wouldn't want to provide the source information in a highly specific way, because then it needs adaptation with new rules. If you only provide start/end that's more future proof.

@mity
Copy link
Owner

mity commented Aug 20, 2020

From what I understand, the parser sends the callbacks in the correct order. So if a link is part of a block quote, you'll get the information about the block quote first. That allows you to apply styling to the block quote, given you know its location. Then you get the information about the anchor span, allowing you to apply the appropriate format in its span. Lastly you get the information about text inside the anchor, allowing you to apply the formatting for the text in an anchor in a block quote.

If you get information where the block quote begins and ends, you have no idea which > inside that range comes form the blockquote decoration, which are part of a normal text content, or which have other special meaning like in a link or autolink syntax construction in a paragraph inside all of that mess.

EDIT: There may be an escaped > inside a link body nested in a blockquote, nested in an ordered list, all of that nested in two other block quotes.

EDIT 2: And there are proposals/demand for new container blocks, like a new table syntaxes which would allow to accommodate multi-line text in a cell. So consider a block quote inside something like that. The API has be to extensible for such things into the future. Sources: 1, 2, 3, and some discussions at https://talk.commonmark.org/ but that seems inaccessible right now

@niblo
Copy link
Contributor Author

niblo commented Aug 20, 2020

it's a proposal of how an alternative API could look like. md4c is not exactly super easy to understand in an hour or so, which makes it hard to "quickly" add some offset information to the parser. The implementation is highly optimised for speed, not for comprehension.

And that's why there is not much to gain from just proposing an API. The implementation is the issue. The risk is that we discuss this to death. A partial implementation would be very valuable. Even a failed attempt will give much insight.

@karstenBriksoft
Copy link
Contributor

I think i'm starting to get where you're coming from.

If i take the quoted code below from your earlier example:

>>>> [This is a link broken
>>>> into multiple
>>>> lines and nested in
>>>> a (deeply nested chain of) 
>>>> blockquote(s)](http://example.com
>>>> 'title')

the anchor is split over multiple lines, likewise the blockquote is split and it's all inside a codeblock that's again part of a blockquote.
Only providing start and end-offsets on a block-level is insufficient, you are absolutely right about that. But I think what could help would be a list of ranges:

  • The outer blockquote has 8 lines, begins at the first column of each line and goes until the end. So its ranges would be reported as 8 ranges.
  • The nested codeblock is inset by 5 characters, so its 8 lines would be reported as such.
  • The blockquote inside the codeblock only has 6 lines but also starts at column 5.
  • The anchor then has the ranges that start at column 10, but also 6 lines like the block quote.

That means that enter-callbacks probably wouldn't need an offset information at all but leave-callbacks would then provide a list of start- and end-offset tupels.
That's still generic enough to apply to all kinds of things and be future proof, but it's precise enough to not create too much ambiguity.

@rexikan
Copy link

rexikan commented Oct 6, 2020

My use case for source mapping is a markdown editor that shows some markers (like for emphasis), and it hides others (like for tables). We use offsets in the original markdown as positions, and we need it for all parsed content including markers.

As the SAX interface already has callbacks indicating where things start and end, it would be natural to extend it with a marker callback that reports a marker type including the marker text and offsets:

  • It should be possible to exactly recreate the markdown from the events by concatenating the text of markers, attributes, and text in the order the events arrive. This is not only good for testing, but it is also what a pure markdown syntax highlighter would need, to get the type of all the ranges of the text.
  • It should be possible to associate a marker with its block or span. For example, multiline nested blockquote would report markers for the blockquotes interleaved.
  • It would be nice if blocks and inlines also reported offsets, but it would not be strictly necessary as one could keep track of the current offset by looking at the text, attributes, and markers.

This kind of interface has several good features:

  • It is a natural extension of the SAX interface.
  • It would support any markup type without requiring complicated structures for blocks and inlines.
  • It is future proof.
  • It has all the details, so it should be good enough for most applications.
  • It is easy to just not provide a callback for markers if you are not interested in them.

It could be useful to know the type of a marker, for example for a link title it could be useful to know it is the beginning quotation of the title. On the other hand, it would quickly be quite complicated and in most cases, it would not be needed anyway. And it is still possible to look at the content and order of the markers within a block to make a good enough guess in the application.

@mity
Copy link
Owner

mity commented Oct 6, 2020

@rexikan

As the SAX interface already has callbacks indicating where things start and end.

I'm not sure what you exactly mean here: As of now, there are no offsets referring to the source are propagated into the callbacks.

Probably the only exception is the offset of the mark in the task list (MD_BLOCK_LI_DETAIL::task_mark_offset) as it is likely the only features which can be interactive in some applications. It's to allow the application to change the task status (toggle the checkbox in user interface can lead to rewrite space with X or vice versa).

The offsets in the structure MD_ATTRIBUTE are only relative to the given text start. I.e. MD_ATTRIBUTE::substr_offsets[0] is always zero, no matter where the given string is located in the source.

MD_ATTRIBUTE::text can even be a text pre-processed in a helper temporary buffer instead of using some part of the input source: For example, with the extension MD_FLAG_PERMISSIVEWWWAUTOLINKS or MD_FLAG_PERMISSIVEEMAILAUTOLINKS he address is composed in an auxiliary buffer (the prefix http:// or mailto: needs to be added) and only the result as a whole is propagated into the callback as one attribute.

And I cannot guarantee some future Markdown features would need it to do anything similar even in most important/common situations.

It should be possible to exactly recreate the markdown from the events by concatenating the text of markers, attributes, and text in the order the events arrive.

The parsing works (and has to work) in two passes over the input (that's btw a reason why there is no Markdown parser which can work in a streaming fashion): The first pass is responsible for a block analysis (and also collects all link reference definitions) and then inline analyses of every block.

In MD4C, a lot of stuff gathered and analyzed during the block analysis is simply forgotten if it's not strictly needed later, so a lot of work would be needed to keep it around so it could be fired into the application in the right order. The callbacks (even for the enclosing blocks) are only called later during the inline pass. Only very minimal information is currently propagated to the 2nd pass.

The output of the block analysis is just a list of blocks in a very condensed representation, and for each block there is more or less only a vector of its "lines" where the line struct holds only beginning and end offset in the source input, so that its contents is stripped of any indentation or block decorations.

The inline analyzer simply processes the stuff inside those (stripped) lines so that all the stuff in the gaps between them is ignored: After all there is a block encoding stuff the inline analyzer does not understand and it would confuse its parsing.

It could be relatively easy to change MD4C so that it calls a callback passing the gap contents in the right moments (between processing one line and subsequent line), but it could provide no additional information what the characters in it means or how it is related to the stack of nested blocks the currently processed paragraph lives in.

Changing it so that some richer data is passed to the 2nd pass would be a lot of work, and imho quite bad from the maintenance point of view: Suddenly the 2nd pass would be aware of all that info and understand it at least to some degree, whilke now the two passes are very independent and the inline analyzer even has no need to know whether it is a top level block or inside a block quote or inside a list or a table cell, or some devilish combination of all of those in some nesting:

1. * > - * > > > > * | foo | bar | baz |
                     | --- | --- | --- |
                     | hello | from | a table |

The "lost information" in the gaps between the lines or the blocks includes for example:

  • Any info about reference link definitions: Where are they defined and how exactly are they defined. We only have a dictionary of the definitions, so we can query it by the labels when resolving what's a link and what is not. But it's not usable for firing any events at the right time.
  • Any info about blank lines.
  • Any info about ends of lines (is it \n or \r\n for example?).
  • Any info about indentation, list item marks, block quite marks like their offsets or how exactly they look like.
  • Any info about ATX header prefix and optional suffixes.
  • Any info about Setext header underlines.
  • And there's likely more.

It should be possible to associate a marker with its block or span. For example, multiline nested blockquote would report markers for the blockquotes interleaved.

I'm lost here: If you're nested in multiple levels of blockquotes, and your callbacks gets called saying "Hey, I've encountered a marker which encodes a blockquote", how are you exactly determining which of the nested blockquotes it is really about? We currently have no unique block identifiers or anything similar: The application implementing the callbacks just maintains the stack of started (and not yet finished) blocks on its own.

@rexikan
Copy link

rexikan commented Oct 7, 2020

As the SAX interface already has callbacks indicating where things start and end.

I'm not sure what you exactly mean here: As of now, there are no offsets referring to the source are propagated into the callbacks.

Yes, it was poorly formulated. I meant to say that the SAX interface reports boundaries of syntax elements in the order of the text. Adding offsets to events would give the exact location. From that perspective, it is a natural extension to have callbacks for markers as well.

The benefit I am trying to communicate is that this is a more flexible way to report markers than trying to come up with a data structure on the existing events that will cover all the cases of possible marker locations.

MD_ATTRIBUTE::text can even be a text pre-processed in a helper temporary buffer instead of using some part of the input source: For example, with the extension MD_FLAG_PERMISSIVEWWWAUTOLINKS or MD_FLAG_PERMISSIVEEMAILAUTOLINKS he address is composed in an auxiliary buffer (the prefix http:// or mailto: needs to be added) and only the result as a whole is propagated into the callback as one attribute.

In any case, it seems good to provide accessors for attributes as they might be pre-processed as you say. If one wants to be able to recreate markdown exactly from the SAX events, then the attributes must also be reported as they appeared in the markdown.

It could be relatively easy to change MD4C so that it calls a callback passing the gap contents in the right moments (between processing one line and subsequent line), but it could provide no additional information what the characters in it mean or how it is related to the stack of nested blocks the currently processed paragraph lives in.

It might be good enough. Markdown is so messy that it is hard to see how one could do something that is really exact. For inlines at least one should be able in the app to associate markers with the inline at the top of the stack. For block elements, it would be more tricky. I one could at least get the marker type (like list_item_marker_type), it would be of great help.

It should be possible to associate a marker with its block or span. For example, multiline nested blockquote would report markers for the blockquotes interleaved.

I'm lost here: If you're nested in multiple levels of blockquotes, and your callbacks gets called saying "Hey, I've encountered a marker which encodes a blockquote", how are you exactly determining which of the nested blockquotes it is really about? We currently have no unique block identifiers or anything similar: The application implementing the callbacks just maintains the stack of started (and not yet finished) blocks on its own.

I assume that the markers would arrive in the same order as the stack (from bottom to top), at least for blockquotes and lists.

I know that this is an incredibly tricky problem, and there will be tradeoffs. Even getting offsets for the current events would go a long way for many applications.

@jokteur
Copy link

jokteur commented Mar 9, 2023

I stumbled onto this issue and had the exact same problem. I first tried to modify md4c for my needs, but with very little success. I ended up rewriting a complete parser (with some flavored Markdown). The idea is quickly represented here.

My solution for the problem

Let's say that we have the following Markdown example:

- >> [abc
  >> def](example.com)

This example would generate an abstract syntax tree (AST) like:

DOC
  UL
    LI
      QUOTE
        QUOTE
          P
            URL

How do we attribute each non-text markers (like -, >, [, ...) to the correct block / span ?

I created a parser to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an array of boundaries. A boundary is defined as follows:

struct Boundary {
    int line_number;
    int pre;
    int beg;
    int end;
    int post;
};

This struct designates offsets in the raw text which form its structure. line_number is the line number in the raw text on which the boundary is currently operating. Offsets between pre and beg are the pre-delimiters, and offsets between end and post are the post-delimiters. Everything between beg and end is the content of the block / span.

Here is a simple example. Suppose we have the following text: _italic_, which starts at line 0 and offset 0 then the boundary struct would look like {0, 0, 1, 7, 8}.

Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is x, it indicates a delimiter, if there is _ it indicates content, and . indicates not in boundary. Here are the ownership for each block and span:

- >> [abc
  >> def](example.com)

UL:
_________
______________________

LI:
xx_______
xx____________________

QUOTE (1st):
..x______
..x___________________

QUOTE (2nd):
...xx____
...xx_________________

P:
.....____
....._________________

URL:
.....x___
.....___xxxxxxxxxxxxxx

So at each block / span enter, an array of boundaries is provided to the caller. This may inform of the caller of all markers used to create a specific block / span.

In the case of lists and sub-lists, spaces are attributed in the following way:

- a
  - b
    c

UL:
___
_____
_____

LI:
xx_
xx___
xx___

SUB-UL:
....
..___
..___

SUB_LI:
....
..xx_
..xx_

P:
....
...._
...._

I hope this idea will help people looking for a solution to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants