Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCypher Antlr - strange parse tree (c#) #515

Open
pmikstacki opened this issue Nov 17, 2021 · 3 comments
Open

OpenCypher Antlr - strange parse tree (c#) #515

pmikstacki opened this issue Nov 17, 2021 · 3 comments

Comments

@pmikstacki
Copy link

Hi!
I have generated cypher lexer and parser using grammar on the openCypher Resources page. However when I enter Expression Context it parses into this mess
antlr

is it supposed to be that way? When I look at railroad diagram it looks like they're not in parent child relationship, they are in sibling-sibling relationship. Or am I getting it wrong?
Best Regards,
Piotr Mikstacki

@hvub
Copy link
Contributor

hvub commented Nov 25, 2021

Hi Piotr,

Thank you for your question and for engaging with the openCypher artifacts.

Unfortunately, I cannot really get a handle on your question. Could be please be more precise? What is precisely the mess? In which respect you find things messy? What is the picture suppose to show? Which parser generator framework are you using Antlr, I guess? Which query are you trying to parse?

Can you try to reduce the problem to a very small instance (very short and simple query), where you start to get "the mess".

Sorry for all the questions, but your post does not provide me enough context to do anything helpful with it.

Cheers
-Hannes

@pmikstacki
Copy link
Author

pmikstacki commented Nov 27, 2021

Hi Hannes,
The exact cypher query is:

MATCH(n:Person) 
WHERE n.name='Adam' AND n.Age = 18
RETURN n

I am starting to get "the mess" in the Expression Evaluation.
I am using the newest Antlr version to generate my parser/lexer.
This is the correct Raiload Diagram from the OpenCypher Website (https://s3.amazonaws.com/artifacts.opencypher.org/M18/railroad/Expression.html)
Raiload Expression
As you can see AND,XOR,OR are "sibling-sibling" relationship
However, it becomes "parent-child" relationship and is also incomplete (it stops where it is supposed to have children - see the intelisense):
Parent-Child
In the generated parser and lexer the type dependency diagram for the expression context looks like this:
dependency
Which means that there is a parent-child relationship.
As an experiment I wrote a recursive method to manually walk the parse tree and write out every lexeme (child context) that it finds. Indents indicate Child-Parent Relationships in a context.

--- Start Parse Tree ---
Type: OC_StatementContext, Value: 
MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
RETURN n
Type: OC_QueryContext, Value: 
MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
RETURN n

Type: OC_RegularQueryContext, Value: 

MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
RETURN n

Type: OC_SingleQueryContext, Value: 

MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
RETURN n

Type: OC_SinglePartQueryContext, Value: 

MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
RETURN n

Type: OC_ReadingClauseContext, Value: MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
Type: OC_MatchContext, Value: MATCH(n:Person)
WHERE n.name='Adam' AND n.Age = 18
Type: TerminalNodeImpl, Value: MATCH
Type: OC_PatternContext, Value: (n:Person)
 Type: OC_PatternPartContext, Value: (n:Person)
 Type: OC_AnonymousPatternPartContext, Value: (n:Person)
 Type: OC_PatternElementContext, Value: (n:Person)
 Type: OC_NodePatternContext, Value: (n:Person)
 Type: TerminalNodeImpl, Value: (
 Type: OC_VariableContext, Value: n
  Type: OC_SymbolicNameContext, Value: n
  Type: TerminalNodeImpl, Value: n
 Type: OC_NodeLabelsContext, Value: :Person
   Type: OC_NodeLabelContext, Value: :Person
   Type: TerminalNodeImpl, Value: :
   Type: OC_LabelNameContext, Value: Person
    Type: OC_SchemaNameContext, Value: Person
    Type: OC_SymbolicNameContext, Value: Person
    Type: TerminalNodeImpl, Value: Person
 Type: TerminalNodeImpl, Value: )
Type: TerminalNodeImpl, Value:

Type: OC_WhereContext, Value: WHERE n.name='Adam' AND n.Age = 18
   Type: TerminalNodeImpl, Value: WHERE
   Type: TerminalNodeImpl, Value:
   Type: OC_ExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_OrExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_XorExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_AndExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_NotExpressionContext, Value: n.name='Adam'
     Type: OC_ComparisonExpressionContext, Value: n.name='Adam'
     Type: OC_AddOrSubtractExpressionContext, Value: n.name
     Type: OC_MultiplyDivideModuloExpressionContext, Value: n.name
     Type: OC_PowerOfExpressionContext, Value: n.name
     Type: OC_UnaryAddOrSubtractExpressionContext, Value: n.name
     Type: OC_StringListNullOperatorExpressionContext, Value: n.name
     Type: OC_PropertyOrLabelsExpressionContext, Value: n.name
     Type: OC_AtomContext, Value: n
     Type: OC_VariableContext, Value: n
     Type: OC_SymbolicNameContext, Value: n
     Type: TerminalNodeImpl, Value: n
     Type: OC_PropertyLookupContext, Value: .name
      Type: TerminalNodeImpl, Value: .
      Type: OC_PropertyKeyNameContext, Value: name
       Type: OC_SchemaNameContext, Value: name
       Type: OC_SymbolicNameContext, Value: name
       Type: TerminalNodeImpl, Value: name
     Type: OC_PartialComparisonExpressionContext, Value: ='Adam'
      Type: TerminalNodeImpl, Value: =
      Type: OC_AddOrSubtractExpressionContext, Value: 'Adam'
       Type: OC_MultiplyDivideModuloExpressionContext, Value: 'Adam'
       Type: OC_PowerOfExpressionContext, Value: 'Adam'
       Type: OC_UnaryAddOrSubtractExpressionContext, Value: 'Adam'
       Type: OC_StringListNullOperatorExpressionContext, Value: 'Adam'
       Type: OC_PropertyOrLabelsExpressionContext, Value: 'Adam'
       Type: OC_AtomContext, Value: 'Adam'
       Type: OC_LiteralContext, Value: 'Adam'
       Type: TerminalNodeImpl, Value: 'Adam'
     Type: TerminalNodeImpl, Value:
     Type: TerminalNodeImpl, Value: AND
     Type: TerminalNodeImpl, Value:
     Type: OC_NotExpressionContext, Value: n.Age = 18
         Type: OC_ComparisonExpressionContext, Value: n.Age = 18
         Type: OC_AddOrSubtractExpressionContext, Value: n.Age
         Type: OC_MultiplyDivideModuloExpressionContext, Value: n.Age
         Type: OC_PowerOfExpressionContext, Value: n.Age
         Type: OC_UnaryAddOrSubtractExpressionContext, Value: n.Age
         Type: OC_StringListNullOperatorExpressionContext, Value: n.Age
         Type: OC_PropertyOrLabelsExpressionContext, Value: n.Age
         Type: OC_AtomContext, Value: n
         Type: OC_VariableContext, Value: n
         Type: OC_SymbolicNameContext, Value: n
         Type: TerminalNodeImpl, Value: n
         Type: OC_PropertyLookupContext, Value: .Age
          Type: TerminalNodeImpl, Value: .
          Type: OC_PropertyKeyNameContext, Value: Age
           Type: OC_SchemaNameContext, Value: Age
           Type: OC_SymbolicNameContext, Value: Age
           Type: TerminalNodeImpl, Value: Age
         Type: TerminalNodeImpl, Value:
         Type: OC_PartialComparisonExpressionContext, Value: = 18
           Type: TerminalNodeImpl, Value: =
           Type: TerminalNodeImpl, Value:
           Type: OC_AddOrSubtractExpressionContext, Value: 18
             Type: OC_MultiplyDivideModuloExpressionContext, Value: 18
             Type: OC_PowerOfExpressionContext, Value: 18
             Type: OC_UnaryAddOrSubtractExpressionContext, Value: 18
             Type: OC_StringListNullOperatorExpressionContext, Value: 18
             Type: OC_PropertyOrLabelsExpressionContext, Value: 18
             Type: OC_AtomContext, Value: 18
             Type: OC_LiteralContext, Value: 18
             Type: OC_NumberLiteralContext, Value: 18
             Type: OC_IntegerLiteralContext, Value: 18
             Type: TerminalNodeImpl, Value: 18
Type: TerminalNodeImpl, Value:

Type: OC_ReturnContext, Value: RETURN n
  Type: TerminalNodeImpl, Value: RETURN
  Type: OC_ProjectionBodyContext, Value:  n
   Type: TerminalNodeImpl, Value:
   Type: OC_ProjectionItemsContext, Value: n
    Type: OC_ProjectionItemContext, Value: n
    Type: OC_ExpressionContext, Value: n
    Type: OC_OrExpressionContext, Value: n
    Type: OC_XorExpressionContext, Value: n
    Type: OC_AndExpressionContext, Value: n
    Type: OC_NotExpressionContext, Value: n
    Type: OC_ComparisonExpressionContext, Value: n
    Type: OC_AddOrSubtractExpressionContext, Value: n
    Type: OC_MultiplyDivideModuloExpressionContext, Value: n
    Type: OC_PowerOfExpressionContext, Value: n
    Type: OC_UnaryAddOrSubtractExpressionContext, Value: n
    Type: OC_StringListNullOperatorExpressionContext, Value: n
    Type: OC_PropertyOrLabelsExpressionContext, Value: n
    Type: OC_AtomContext, Value: n
    Type: OC_VariableContext, Value: n
    Type: OC_SymbolicNameContext, Value: n
    Type: TerminalNodeImpl, Value: n
--- End Parse Tree ---

For convenience, I attached a graphical representation of all elements of the parse tree diagram
Parse Tree Diagram Expression

As you can see, current interpretation of a parse tree is incomplete and incorrect. Especially this part seems to defy all logic, how to determine if this is any of these expression? The only way I see this parsed is to write Template-like structure to see at the children of every context and somehow take into account every edge case

Type: OC_ExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_OrExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_XorExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_AndExpressionContext, Value: n.name='Adam' AND n.Age = 18
     Type: OC_NotExpressionContext, Value: n.name='Adam'
     Type: OC_ComparisonExpressionContext, Value: n.name='Adam'
     Type: OC_AddOrSubtractExpressionContext, Value: n.name
     Type: OC_MultiplyDivideModuloExpressionContext, Value: n.name
     Type: OC_PowerOfExpressionContext, Value: n.name
     Type: OC_UnaryAddOrSubtractExpressionContext, Value: n.name
     Type: OC_StringListNullOperatorExpressionContext, Value: n.name
     Type: OC_PropertyOrLabelsExpressionContext, Value: n.name

Here's a sourcecode of ExpressionAnalyzer - It analyzes the expression : (https://github.com/pmikstacki/SliccDB/blob/master/SliccDB.Cypher/Analyzers/ExpressionAnalyzer.cs)

Here's a method that I used to generate text parse tree: (https://github.com/pmikstacki/SliccDB/blob/master/SliccDB.Cypher/Utility/ParseTreeHelper.cs)

Here's a code that I used to generate the Graphical Representation of a parse tree: (https://github.com/pmikstacki/SliccDB/blob/master/SliccDB.Explorer/Utility/DebugParseTreeVisual.xaml.cs)

Here's a working implementation of PatternPart that is working correctly (notice the similar way that I try to parse the oC_PatternElement: (https://github.com/pmikstacki/SliccDB/blob/master/SliccDB.Cypher/Analyzers/PatternPartAnalyzer.cs)

I hope that the details above are sufficient.
Is this supposed to look like this or there is an error in grammar?
Kind Regards,
Piotr Mikstacki

@hvub
Copy link
Contributor

hvub commented Nov 30, 2021

Hi Piotr,

thanks for your detailed reply.

I understand you are confused why a parser gives you for n.name='Adam' AND n.age=18 an OrExpression containing a XorExpression containing an AndExpression containing a NotExpression and so on all the way through the whole stack of expressions – while at the same time the railroad diagrams show them all neatly as alternatives of one single thing.

The answer to this is that two-fold.

Firstly, the railroad diagrams are not the ground truth of the grammar, but merely a more developer-manual-like representation of the grammar that gets generator out of the grammar xml files. The g4 antlr grammar also gets generated from the same xml files. These xml files are the ground truth of the grammar.

For (subjectively) better readability, the railroad diagrams inline certain grammar productions into a single picture. This is again encoded in the grammar xml files. Productions that have the attribute rr:inline="true" get inlined. You will find this attribute on almost all expression productions. This is why they all end up in a single railroad diagram, although they are in fact a large number of individual grammar productions.

Secondly, let's look at the "true" grammar as encoded in the xml files. The production Expression is the top of the stack of productions that define the syntax of expressions, followed by the productions for the logical expressions (OR, XOR, AND, etc). For these you find the following. in the xml files:

  <production name="Expression">
    <non-terminal ref="OrExpression"/>
  </production>

  <production name="OrExpression" rr:inline="true">
    <non-terminal ref="XorExpression"/>
    <repeat>&SP; OR &SP;<non-terminal ref="XorExpression"/></repeat>
  </production>

  <production name="XorExpression" rr:inline="true">
    <non-terminal ref="AndExpression"/>
    <repeat>&SP; XOR &SP;<non-terminal ref="AndExpression"/></repeat>
  </production>

  <production name="AndExpression" rr:inline="true">
    <non-terminal ref="NotExpression"/>
    <repeat>&SP; AND &SP;<non-terminal ref="NotExpression"/></repeat>
  </production>

  <production name="NotExpression" rr:inline="true">
    <repeat>NOT &WS;</repeat>
    <non-terminal ref="ComparisonExpression"/>
  </production>

They all follow the same pattern. Every production allows the syntax to repeat the same operation zero or more time, while the respective operands are expression of operand that is next in order of increasing precedence. This allows to nest operations of higher precedence into operation of lower precedence, so that you can write a XOR b OR c XOR d, where a XOR b and c XOR d are operands of _ OR _.

Note also that each production allows zero repetitions of the operation it encodes, where zero repetition means the operation does not appear at all. The effect is that an expression can "drop through" the operations it does not need. Or more specifically, it allows that an _ OR _ does not require its operand to be _ XOR _s but allows any expression of high precedence, so that you can also write e.g. NOT b OR c AND d, where NOT b and c AND d are operands of _ OR _.

This pattern continues through the whole stack of expressions and ends in the production called Atom. Atoms are expressions with the highest precedence such as a literal or an expression enclosed in parentheses. The latter loops back to the top of the expression stack, which allows nesting operations of lower precedence under operations of high precedence, so that you can write a AND (b OR c), where a and b OR c are the operands of _ AND _, something the production listed above do not allow by themselves.

This recursive pattern of "drop through" production rules is the usual way expression grammar looks like if they encode the precedence. Some grammars have recursion in each production itself (instead of repetition), to encode operator associativity in the grammar. The oC grammar does not do that. Some grammars have multiple such stacks of production with the grammar also encode value type rules (e.g. you can not AND´d integers). The oC grammar does not encode value type rules so it has just a single stack of expression productions including all operations.

When consuming the parsing result, the effect of this grammar design is that you see also expressions in the parse tree. because the parse put a node in the parse tree for each production it uses. This is totally normal and expected.

Many implementations have some kind of normalization and simplification step that turn the parse tree into an intermediate query representation (IR) which is usually also a tree and is feed into the query processor. This IR would usually have all "drop through" productions be removed, because they are a pure parsing artifacts. It is easy to detect, for instance, if an OrExpression only contains a single child (which is an XorExpression) that where is no actual OR in the query text and the IR does not need a node for the OR operations. Note this kind of removal make sense in the expression context but is not a generally correct step. In other contexts, even the parse tree node with only a single child may convey valuable information. However, when designing an IR you are free to chose how to represent such information in your IR. Parse generator frameworks such as Antlr cannot know these things, since they only see the grammar, i.e. a syntax specification and do not know the semantics of the specified language.

Hope that helps.

Best
-H

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants