[RFC] Add syntax to include grammar by resetting their base #1276

vmg · 2014-11-12T23:19:53Z

Allan,

Here's a small proposal to fix a (non-critical) issue we found while deploying TextMate grammars to production.

As you obviously know (since you designed the format, haha), include rules in grammars can use the $self and $base magic variables to recursively include themselves or the base grammar at the root of the parse tree. This is a crucial feature to parse many programming language that have some kind of recursion in their syntax.

Grammars like C (source.c), however, routinely include $base instead of $self for their recursion rules. This is because the C grammar is included from other languages (like source.cpp or source.objc) to provide basic syntactic parsing, and when including itself recursively, we want the base grammar to be included again (or else chunks of source.cppwould be parsed as source.c, as they would be missing the C++ rules).

The bug, which I believe is not trivial to fix, arises on languages that include a grammar like source.c not to extend their syntax, but to parse a chunk of code as a different language.

Two obvious examples of this are source.lua and source.ruby, which include source.c to highlight an external block of C declarations or a heredoc with C code, respectively.

The result looks like this:

In this case, when source.lua includes source.c, and as soon as C does a recursive include (when parsing the inside of a struct definition), the $base rule is obviously Lua, so all the C parsing breaks. We're now parsing Lua inside C inside of Lua. This is not what we want!

So, how can we work around this? I propose the following small change in syntax: In an include rule, a scope name followed by two hashes (##) resets the base grammar for the inclusion.

Examples:

source.ruby#
Include source.ruby with the current base as base.
Equivalent to source.ruby.
'source.ruby## Includesource.rubywithsource.ruby` as base
source.ruby#regexp
Include the rule regexp from source.ruby, using the current
base as base.
source.ruby##regexp
Include the rule regexp from source.ruby, but use
source.ruby as base.

This change will allow any languages that need to require a sub-language in an isolated way to do so. I believe this is the least intrusive way to fix this issue.

@sorbits: Do you think this is reasonable, or can you come up with a more elegant way to fix the issue? I'm all ears and eager for your feedback. I'd love to get this fixed! :)

Cheers,
vmg

A scope name followed by two hashes (`##`) resets the base grammar for an inclusion rule. Examples: - `source.ruby#` Include `source.ruby` with the current base as base. Equivalent to `source.ruby`. - 'source.ruby##` Include `source.ruby` with `source.ruby` as base - `source.ruby#regexp` Include the rule `regexp` from `source.ruby`, using the current base as base. - `source.ruby##regexp` Include the rule `regexp` from `source.ruby`, but use `source.ruby` as base.

This reverts commit 6dcfcf2.

This key has the same behavior as the previous implementation, but with the advantage of being backwards compatible.

vmg · 2014-11-13T11:21:42Z

@sorbits: I gave this some thought overnight and decided that maybe something ike vmg@e0bf32e would be a better idea.

Adding a separate flag means that we can fix grammars, e.g.

      "patterns": [
        {
          "include": "source.c",
          "include_absolute": "1"
        }
      ]

And the new version of TextMate will load these properly, whilst not breaking backwards compatibility with the previous version (or other parsers that don't support this feature).

This is slightly less pretty, but I assume backwards compatibility is a huge deal for you (and rightly so), so I'm leaning towards this approach.

sorbits · 2014-11-14T09:51:07Z

I think your proposal of adding syntax to the include rule is the best
way to solve this issue.

As for the actual syntax, I propose we treat base as a variable that
we overwrite, for example:

{ include = 'source.c'; base = '$self' };

Alternatively we can set base to source.lua, or even a completely
different grammar.

There have been a few requests for user-specified variables, so the
above thinking would fit nicely into such potential extension, though it
makes me want to wrap the assignment in a variables sub-section so
that we can use the same syntax for every rule, having descendent rules
be affected, and I believe, avoid any special-casing of the include
rule.

So with that syntax, it would look like this:

{   include = 'source.c';
    variables = {
        base = '$self';
    };
};

The general use case for user-specified variables is mainly to define an
identifer regexp to be used in all rules where the language spec allow
an identifier, and where an identifier is somewhat complex.

What do you think?

cc: @joachimm

On 13 Nov 2014, at 12:21, Vicent Marti wrote:

@sorbits: I gave this some thought overnight and decided that maybe
something ike
vmg@e0bf32e
would be a better idea.

Adding a separate flag means that we can fix grammars, e.g.
  "patterns": [
    {
      "include": "source.c",
      "include_absolute": "1"
    }
  ]
And the new version of TextMate will load these properly, whilst not
breaking backwards compatibility with the previous version (or other
parsers that don't support this feature).

This is slightly less pretty, but I assume backwards compatibility is
a huge deal for you (and rightly so), so I'm leaning towards this
approach.

Reply to this email directly or view it on GitHub:
#1276 (comment)

vmg · 2014-11-14T10:41:12Z

@sorbits: Personally, I believe the variables feature should be orthogonal to this. Adding a variables rule next to the include gives us some very fuzzy logic for variable scoping. Ideally, a variables block would only be possible at the top level of a grammar, just like repositories, and apply solely to that grammar.

If you add a variable for an include, things become very complex (and I'd argue they simply break down if you actually recursively apply the replacement to all other rules -- imagine a case where we replace $base on a grammar that also replaces $base).

Note that I really like the idea of variable replacements, but applying them recursively for includes sounds like a recipe for disaster. For this specific use case, we want to reset the base of the include, not replace it with something else. Also, being able to replace the value of $self instead of base sounds scary.

So I would learn towards something like { includeResetBase : 1 }, which also has the benefit of being 100% backwards compatible with all other implementations.

infininight · 2014-11-17T04:13:45Z

So I've been thinking about this for a couple days, seems to me what triggers this issue is when a new context is created. In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include? I guess it doesn't make much real difference it just seems that the block is what is creating the new context, that the include needs to honor it is incidental.

vmg · 2014-11-17T09:56:02Z

So I've been thinking about this for a couple days, seems to me what triggers this issue is when a new context is created. In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include?

I'm not sure I follow. What do you mean exactly by a context?

infininight · 2014-11-17T10:17:28Z

I mean that it is the begin/end rule that creates a new context which in this case is a block of embedded source.c. i.e.: It is that block that really creates the need for a new 'base' as it should really be treated more as a new document to the parser. (For this purpose at least.)

aroben · 2014-11-17T15:04:37Z

In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include?

If that were all that were desired, we could just change the C grammar to include $self instead of $base. But reaching back up to the $base grammar is actually what is desired in the case of C++ and Objective-C.

vmg · 2014-11-17T15:14:26Z

Yes, I believe that resetting the base should be a choice when including a subgrammar, and not really related to the block it's included in.

And going back to @sorbits' suggestion: it's becoming increasingly clear to me that, although variable substitution would be a great thing to implement (and heck -- I personally wouldn't mind writing the patchset myself and send you a PR, Allan, it sounds like a very useful thing to have), both $base and $self should always be kept as "magical" (special-case) variables, and never be allowed to be replaced, because the implications of selectively replacing either in a block are too many.

Hence, my suggestion for a syntax to clear the base rule for an inclusion, but not to arbitrarily replace it, because that would surely lead to chaos.

sorbits · 2014-11-18T10:35:56Z

For variables, there are two possibilities.

The macro-like approach which it sounds like you have in mind. Each grammar has a variable section at the top level which can be used only within that grammar, and in theory we could expand all variables before using the grammar. This is well-suited for things like common patterns like identifiers.
A more dynamic appraoch where variables comes from the current context, which would inherit variables from parent grammars.

I think the flexibility of dynamic variables has some good use-cases as well.

We could e.g. do a common line comment rule which would be used like this:

{  include = 'source.common#line_comment';
   variables = {
      commentCharacter = '#';
   };
};

Though a better reaosn for dynamic variables is probably the current C, C++, Objective-C, and Objective-C++ grammars. The C grammar has a rule to match stdlib functions, the Objective-C grammar has a similar rule for Cocoa functions. The C++ grammar has rules for matching braces to introduce scopes for namespaces and classes, so it needs to include the C grammar’s functions inside these new scopes. The problem is with Objective-C++, this one includes the Objective-C and C++ grammars, the latter includes only C functions in its brace scopes, but it should also include the Objectice-C functions when included from Objective-C++.

This could be solved by having the C grammar include $functions instead. The grammar itself sets this variable to #functions, the Objectice-C grammar changes it to source.objc#functions.

As for “fuzzy logic” for variable scopes: I think we can define our way out of that.

So back to $base: We already use variable (format string) syntax when referencing it, so why not use variable declaration syntax to change it? We can simply define something like: “base can only be set in include rules and the only supported value is $self”. That way we only allow the one single thing you have suggested, namely to reset base, but we do it with syntax that allow us to do more in the future, should we decide to do so.

I think there is value in keeping our options open, and also trying to limit the number of special constructs to a minimum.

As for setting base to something other than $self: I think that might actually be a useful feature. Imagine for example we do a C here-doc in ruby, we would include source.c, but in the here-doc we support #{embedded ruby code}, so we could have a rule in the ruby grammar (embedded_c) which matches embedded code, includes source.c, and sets base to source.ruby#embedded_c.

Ideally though we would use injection to match embedded code, but the example still show that there could be value in being able to redefine base.

Anyway, for a start I think it’s fine to not allow base to take on other values than $self in an include block, but I would opt for still making it look like a variable, akin to how OO languages have special variables like self, this, and super.

Come to think of it, the shared line comment rule from my first example might use ${base/.*(\..*)/$1/} as suffix in the scopes it creates. So yeah, I really think $base should just be a variable.

vmg · 2014-11-18T10:55:51Z

That way we only allow the one single thing you have suggested, namely to reset base, but we do it with syntax that allow us to do more in the future, should we decide to do so.

I like this approach. Let me see how an implementation looks like.

V0idk · 2018-05-30T05:45:11Z

Did it solved? i am using vscode and have this problem

Markdown code block syntax highlightning is broken for C and C++ #34525

APerricone · 2018-12-04T06:35:51Z

there is an example of Objective-C, C++, and Objective-C++code that does not work without $base?
why does c++ syntax use it too?

vmg added 3 commits November 12, 2014 23:51

Revert "Add syntax to include grammar by resetting their base"

49c9580

This reverts commit 6dcfcf2.

Add an optional key include_absolute to reset includes

e0bf32e

This key has the same behavior as the previous implementation, but with the advantage of being backwards compatible.

sekogan mentioned this pull request Nov 9, 2015

Something wrong with the code block display sekogan/MarkdownLight#2

Open

FichteFoll mentioned this pull request Feb 14, 2016

$top_level_main special context name is not documented sublimehq/Packages#73

Closed

randy3k mentioned this pull request Feb 15, 2016

[For the record] LaTeX highlight issues jonschlinkert/sublime-markdown-extended#119

Open

sorbits mentioned this pull request Jun 3, 2016

Use $self instead of $base textmate/c.tmbundle#44

Closed

sorbits force-pushed the master branch from 71fefb8 to 3ff9a5d Compare July 1, 2016 11:22

sorbits force-pushed the master branch from a24fc78 to 860da51 Compare August 17, 2016 13:30

sorbits force-pushed the master branch from 0da6edc to dd3ebef Compare August 31, 2016 14:18

sorbits force-pushed the master branch from 0a7e1d4 to 7f73203 Compare September 13, 2016 21:49

sorbits force-pushed the master branch 3 times, most recently from 4234188 to 09534b7 Compare September 27, 2016 19:49

tomedunn mentioned this pull request Feb 6, 2017

Add submodule snippet dparkins/language-fortran#94

Merged

mjbvz mentioned this pull request Sep 18, 2017

Incorrect highlighting when source.c or source.cpp is embedded within another grammar atom/language-c#250

Closed

sorbits force-pushed the master branch 2 times, most recently from 093e8eb to d2979e2 Compare September 12, 2019 19:15

sorbits force-pushed the master branch 2 times, most recently from e28f51d to 97caab6 Compare May 26, 2021 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add syntax to include grammar by resetting their base #1276

[RFC] Add syntax to include grammar by resetting their base #1276

vmg commented Nov 12, 2014

vmg commented Nov 13, 2014

sorbits commented Nov 14, 2014

vmg commented Nov 14, 2014

infininight commented Nov 17, 2014

vmg commented Nov 17, 2014

infininight commented Nov 17, 2014

aroben commented Nov 17, 2014

vmg commented Nov 17, 2014

sorbits commented Nov 18, 2014

vmg commented Nov 18, 2014

V0idk commented May 30, 2018

APerricone commented Dec 4, 2018 •

edited

[RFC] Add syntax to include grammar by resetting their base #1276

Are you sure you want to change the base?

[RFC] Add syntax to include grammar by resetting their base #1276

Conversation

vmg commented Nov 12, 2014

vmg commented Nov 13, 2014

sorbits commented Nov 14, 2014

vmg commented Nov 14, 2014

infininight commented Nov 17, 2014

vmg commented Nov 17, 2014

infininight commented Nov 17, 2014

aroben commented Nov 17, 2014

vmg commented Nov 17, 2014

sorbits commented Nov 18, 2014

vmg commented Nov 18, 2014

V0idk commented May 30, 2018

APerricone commented Dec 4, 2018 • edited

APerricone commented Dec 4, 2018 •

edited