Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add syntax to include grammar by resetting their base #1276

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vmg
Copy link

@vmg vmg commented Nov 12, 2014

Allan,

Here's a small proposal to fix a (non-critical) issue we found while deploying TextMate grammars to production.

As you obviously know (since you designed the format, haha), include rules in grammars can use the $self and $base magic variables to recursively include themselves or the base grammar at the root of the parse tree. This is a crucial feature to parse many programming language that have some kind of recursion in their syntax.

Grammars like C (source.c), however, routinely include $base instead of $self for their recursion rules. This is because the C grammar is included from other languages (like source.cpp or source.objc) to provide basic syntactic parsing, and when including itself recursively, we want the base grammar to be included again (or else chunks of source.cppwould be parsed as source.c, as they would be missing the C++ rules).

The bug, which I believe is not trivial to fix, arises on languages that include a grammar like source.c not to extend their syntax, but to parse a chunk of code as a different language.

Two obvious examples of this are source.lua and source.ruby, which include source.c to highlight an external block of C declarations or a heredoc with C code, respectively.

The result looks like this:

screen shot 2014-11-13 at 12 08 00 am

In this case, when source.lua includes source.c, and as soon as C does a recursive include (when parsing the inside of a struct definition), the $base rule is obviously Lua, so all the C parsing breaks. We're now parsing Lua inside C inside of Lua. This is not what we want!

So, how can we work around this? I propose the following small change in syntax: In an include rule, a scope name followed by two hashes (##) resets the base grammar for the inclusion.

Examples:

  • source.ruby#
    Include source.ruby with the current base as base.
    Equivalent to source.ruby.
  • 'source.ruby## Includesource.rubywithsource.ruby` as base
  • source.ruby#regexp
    Include the rule regexp from source.ruby, using the current
    base as base.
  • source.ruby##regexp
    Include the rule regexp from source.ruby, but use
    source.ruby as base.

This change will allow any languages that need to require a sub-language in an isolated way to do so. I believe this is the least intrusive way to fix this issue.

@sorbits: Do you think this is reasonable, or can you come up with a more elegant way to fix the issue? I'm all ears and eager for your feedback. I'd love to get this fixed! :)

Cheers,
vmg

A scope name followed by two hashes (`##`) resets the base grammar for
an inclusion rule.

Examples:

	- `source.ruby#`
		Include `source.ruby` with the current base as base.
		Equivalent to `source.ruby`.

	- 'source.ruby##`
		Include `source.ruby` with `source.ruby` as base

	- `source.ruby#regexp`
		Include the rule `regexp` from `source.ruby`, using the current
		base as base.

	- `source.ruby##regexp`
		Include the rule `regexp` from `source.ruby`, but use
		`source.ruby` as base.
This key has the same behavior as the previous implementation, but with
the advantage of being backwards compatible.
@vmg
Copy link
Author

vmg commented Nov 13, 2014

@sorbits: I gave this some thought overnight and decided that maybe something ike vmg@e0bf32e would be a better idea.

Adding a separate flag means that we can fix grammars, e.g.

      "patterns": [
        {
          "include": "source.c",
          "include_absolute": "1"
        }
      ]

And the new version of TextMate will load these properly, whilst not breaking backwards compatibility with the previous version (or other parsers that don't support this feature).

This is slightly less pretty, but I assume backwards compatibility is a huge deal for you (and rightly so), so I'm leaning towards this approach.

@sorbits
Copy link
Member

sorbits commented Nov 14, 2014

I think your proposal of adding syntax to the include rule is the best
way to solve this issue.

As for the actual syntax, I propose we treat base as a variable that
we overwrite, for example:

{ include = 'source.c'; base = '$self' };

Alternatively we can set base to source.lua, or even a completely
different grammar.

There have been a few requests for user-specified variables, so the
above thinking would fit nicely into such potential extension, though it
makes me want to wrap the assignment in a variables sub-section so
that we can use the same syntax for every rule, having descendent rules
be affected, and I believe, avoid any special-casing of the include
rule.

So with that syntax, it would look like this:

{   include = 'source.c';
    variables = {
        base = '$self';
    };
};

The general use case for user-specified variables is mainly to define an
identifer regexp to be used in all rules where the language spec allow
an identifier, and where an identifier is somewhat complex.

What do you think?

cc: @joachimm

On 13 Nov 2014, at 12:21, Vicent Marti wrote:

@sorbits: I gave this some thought overnight and decided that maybe
something ike
vmg@e0bf32e
would be a better idea.

Adding a separate flag means that we can fix grammars, e.g.

  "patterns": [
    {
      "include": "source.c",
      "include_absolute": "1"
    }
  ]

And the new version of TextMate will load these properly, whilst not
breaking backwards compatibility with the previous version (or other
parsers that don't support this feature).

This is slightly less pretty, but I assume backwards compatibility is
a huge deal for you (and rightly so), so I'm leaning towards this
approach.


Reply to this email directly or view it on GitHub:
#1276 (comment)

@vmg
Copy link
Author

vmg commented Nov 14, 2014

@sorbits: Personally, I believe the variables feature should be orthogonal to this. Adding a variables rule next to the include gives us some very fuzzy logic for variable scoping. Ideally, a variables block would only be possible at the top level of a grammar, just like repositories, and apply solely to that grammar.

If you add a variable for an include, things become very complex (and I'd argue they simply break down if you actually recursively apply the replacement to all other rules -- imagine a case where we replace $base on a grammar that also replaces $base).

Note that I really like the idea of variable replacements, but applying them recursively for includes sounds like a recipe for disaster. For this specific use case, we want to reset the base of the include, not replace it with something else. Also, being able to replace the value of $self instead of base sounds scary.

So I would learn towards something like { includeResetBase : 1 }, which also has the benefit of being 100% backwards compatible with all other implementations.

@infininight
Copy link
Member

So I've been thinking about this for a couple days, seems to me what triggers this issue is when a new context is created. In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include? I guess it doesn't make much real difference it just seems that the block is what is creating the new context, that the include needs to honor it is incidental.

@vmg
Copy link
Author

vmg commented Nov 17, 2014

So I've been thinking about this for a couple days, seems to me what triggers this issue is when a new context is created. In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include?

I'm not sure I follow. What do you mean exactly by a context?

@infininight
Copy link
Member

I mean that it is the begin/end rule that creates a new context which in this case is a block of embedded source.c. i.e.: It is that block that really creates the need for a new 'base' as it should really be treated more as a new document to the parser. (For this purpose at least.)

@aroben
Copy link

aroben commented Nov 17, 2014

In this case an embedded block of C is created, would it not make sense for the block to reset the base rather than the include?

If that were all that were desired, we could just change the C grammar to include $self instead of $base. But reaching back up to the $base grammar is actually what is desired in the case of C++ and Objective-C.

@vmg
Copy link
Author

vmg commented Nov 17, 2014

Yes, I believe that resetting the base should be a choice when including a subgrammar, and not really related to the block it's included in.

And going back to @sorbits' suggestion: it's becoming increasingly clear to me that, although variable substitution would be a great thing to implement (and heck -- I personally wouldn't mind writing the patchset myself and send you a PR, Allan, it sounds like a very useful thing to have), both $base and $self should always be kept as "magical" (special-case) variables, and never be allowed to be replaced, because the implications of selectively replacing either in a block are too many.

Hence, my suggestion for a syntax to clear the base rule for an inclusion, but not to arbitrarily replace it, because that would surely lead to chaos.

@sorbits
Copy link
Member

sorbits commented Nov 18, 2014

For variables, there are two possibilities.

  1. The macro-like approach which it sounds like you have in mind. Each grammar has a variable section at the top level which can be used only within that grammar, and in theory we could expand all variables before using the grammar. This is well-suited for things like common patterns like identifiers.
  2. A more dynamic appraoch where variables comes from the current context, which would inherit variables from parent grammars.

I think the flexibility of dynamic variables has some good use-cases as well.

We could e.g. do a common line comment rule which would be used like this:

{  include = 'source.common#line_comment';
   variables = {
      commentCharacter = '#';
   };
};

Though a better reaosn for dynamic variables is probably the current C, C++, Objective-C, and Objective-C++ grammars. The C grammar has a rule to match stdlib functions, the Objective-C grammar has a similar rule for Cocoa functions. The C++ grammar has rules for matching braces to introduce scopes for namespaces and classes, so it needs to include the C grammar’s functions inside these new scopes. The problem is with Objective-C++, this one includes the Objective-C and C++ grammars, the latter includes only C functions in its brace scopes, but it should also include the Objectice-C functions when included from Objective-C++.

This could be solved by having the C grammar include $functions instead. The grammar itself sets this variable to #functions, the Objectice-C grammar changes it to source.objc#functions.

As for “fuzzy logic” for variable scopes: I think we can define our way out of that.

So back to $base: We already use variable (format string) syntax when referencing it, so why not use variable declaration syntax to change it? We can simply define something like: “base can only be set in include rules and the only supported value is $self. That way we only allow the one single thing you have suggested, namely to reset base, but we do it with syntax that allow us to do more in the future, should we decide to do so.

I think there is value in keeping our options open, and also trying to limit the number of special constructs to a minimum.

As for setting base to something other than $self: I think that might actually be a useful feature. Imagine for example we do a C here-doc in ruby, we would include source.c, but in the here-doc we support #{embedded ruby code}, so we could have a rule in the ruby grammar (embedded_c) which matches embedded code, includes source.c, and sets base to source.ruby#embedded_c.

Ideally though we would use injection to match embedded code, but the example still show that there could be value in being able to redefine base.

Anyway, for a start I think it’s fine to not allow base to take on other values than $self in an include block, but I would opt for still making it look like a variable, akin to how OO languages have special variables like self, this, and super.

Come to think of it, the shared line comment rule from my first example might use ${base/.*(\..*)/$1/} as suffix in the scopes it creates. So yeah, I really think $base should just be a variable.

@vmg
Copy link
Author

vmg commented Nov 18, 2014

That way we only allow the one single thing you have suggested, namely to reset base, but we do it with syntax that allow us to do more in the future, should we decide to do so.

I like this approach. Let me see how an implementation looks like.

@V0idk
Copy link

V0idk commented May 30, 2018

Did it solved? i am using vscode and have this problem

Markdown code block syntax highlightning is broken for C and C++ #34525

@APerricone
Copy link

APerricone commented Dec 4, 2018

there is an example of Objective-C, C++, and Objective-C++code that does not work without $base?
why does c++ syntax use it too?

@sorbits sorbits force-pushed the master branch 2 times, most recently from 093e8eb to d2979e2 Compare September 12, 2019 19:15
@sorbits sorbits force-pushed the master branch 2 times, most recently from e28f51d to 97caab6 Compare May 26, 2021 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants