Skip to content

azatoth/PanPG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PanPG API

PanPG v0.0.9. This documents the most recent stable version. The current bleeding edge documentation is also available and may include differences in the development code since the last release.

The about page always has links to recent versions.

Introduction

PanPG is a parser generator which reads a PEG description of a grammar and writes a packrat parser in pure JavaScript.

There is an API for compiling a parser from a grammar, and a smaller API of helpful convenience functions for dealing with the parse trees that a parser produces.

The PEG Language

Grammars are given as a parser expression grammar (PEG).

The arithmetic grammar from the demo makes a good example of the basic features.

Expr ← Add
 
Add ← Mult ( S? "+" S? Mult )*

Mult ← Num ( S? "*" S? Num )*

Num ← "0" / [1-9] [0-9]*

S ← [ U+0020 ]+

PEGs are similar to context-free grammars (CFGs), but have a more imperative nature: while a CFG describes a language, which may or may not be easy to parse, and may be ambiguous, a PEG describes a deterministic parsing strategy. The unambiguous nature of PEGs makes them convenient for parsing programming languages. Wikipedia's article on PEGs is a good introduction. PEGs do not require a separate lexer; the same formalism is used to define the grammar down to the level of individual characters.

Due to the difference in emphasis between CFGs and PEGs, what in a CFG would be called a production is here called a parse rule.

TODO: document accepted PEG language

Until this is properly documented, the self-describing PEG grammar is the definitive reference. The ECMAScript grammar is a practical example of a full programming language grammar.

Compiling

The following API is used to produce a parser from a PEG. To use it, include the file PanPG.js. The library is only required to generate a parser; once generated, the parsers do not have any dependencies. In most cases, rather than compiling the parser at the point of use, which is slow and requires the parser generator, the parser will instead be compiled once and then included where it will be used. The parser generator API is demonstrated in the arithmetic grammar demo.

PanPG.generateParser(peg,opts)

  • peg is a string, containing the grammar (or an array of grammars to be merged)
  • opts, documented below, optional, controls various details of code generation.

When successful, the return value is a string containing a JavaScript program which declares a single function which is the parser. This string can be passed to eval() or written to a file for later use. In the case of failure, various exceptions will be thrown. Common reasons for failure are syntax errors in the provided PEG, and rules that are depended on but not defined.

If an array is passed as the first argument, each element is parsed as a PEG and the rules are combined, with those occurring in later grammars taking precedence. This is equivalent to using the patches option, described below.

opts

The opts argument is optional and if provided, all the properties are optional. The defined properties and default behaviors are as follows, with the most common and most useful listed first:

  • drop

    An array of parse rule names to drop from the generated parse tree. Any node generated by such a parse, and all its children, will not appear in the parse tree. This does not change what inputs will be accepted by the parser, only the parse tree that is returned. This is commonly used when parsing programming languages to exclude low-level or atomic rules such as whitespace, comments, individual tokens, or any other rule which is not interesting. As an extreme case, if the start rule is dropped, then no parse tree will be returned, but the parser will still recognize the grammar and indicate failure (by returning a parse error) or success (by not returning a parse error). By default, all rules are returned in parse tree results, which is often convenient when designing and testing a grammar.

  • elide

    An array of parse rule names to elide in the generated parse tree. The node generated by any rule appearing in opts.elide will not appear in the generated tree; instead it will be replaced by its children. For example, in the grammar S ← A; A ← B; B ← "x", the expected parse tree would have an S node with an A child node with a B child node. If {elide:['A']} was passed, the generated tree would contain the S node with a B child node, the A node having been elided. If {drop:['A']} was passed, the generated tree would contain only the S node alone. By default, no nodes are elided.

  • fname

    The parser is returned as a JavaScript program which declares a single function. This option sets the name of that function. If the commonjs option is set, this sets the name of the module returned. By default, the start token of the grammar is used as the function name.

  • start

    The start rule may be given as an argument. This can be used to create a parser for a subset of a grammar, which can be useful when testing or when reusing a part of a grammar. Only the named rule given, and its dependencies, will be included in the generated parser. For example, if you had a parser for C and wanted to generate a parser for just C expressions (but not statements or programs) you could do something like generateParser(complete\_C\_grammar, {start:'Expression'}. This can also be combined with opts.patches to create hybrid grammars. It is also possible to store different grammars (such as different versions of the same or closely related languages) in the same file with different start rules, and common rules shared; the desired parser can then be generated by passing the appropriate opts.start rule name. By default, the first parse rule in the PEG grammar is taken to be the start rule.

  • commonjs

    Instead of a bare function, the returned parser will be a CommonJS-compatible module. The name of the module will be taken from the fname option or the start rule if that is unset. The module exports a parse function. The module will be browser-compatible, so it can be used in the browser as well as in a CommonJS environment. This means that a hypothetical parser for language Foo could be used as Foo.parse(...) either in the browser, by including <script src="foo.js"></script>, or in a CommonJS environment by using var Foo=require('foo'). By default, CommonJS output is not used and the returned program declares a single function at the top level.

  • patches

    A set of patches may be provided as an array of PEGs given as strings, just like the main grammar. First each is parsed, then each is applied, in the order given, to the main grammar. To apply a patch here means that each rule appearing in it is added to the main grammar, replacing the existing rule with the same name if any. Since patches can override rules as well as add them, it is possible to replace or modify as much of the grammar as desired. Alternatively, multiple grammars can be provided as an array in the first argument to generateParser.

The following three options are all off by default and are primarily used for debugging of grammars or of the parser generator itself.

  • trace

    If true, the generated parser will generate trace messages, which can be used for extremely detailed debugging and analysis of generated parsers. The trace output will typically be orders of magnitude larger than the parse tree, so this option should be turned on with care; tracing parsers work best on small inputs. When combined with a subset parser using start, this can be a powerful debugging tool. Tracing is used (automatically) by the explain function documented below.

  • debug

    This causes the parser generator to dump large amounts of internal information as a source code comment before the parser itself. This data will be mostly of interest to those who are working on PanPG itself. Apart from the added comment, the generated parser code is unchanged.

  • asserts

    If true, the generated parser will include assertion statements, and will be slower. By design, the assertions should never fail, regardless of the grammar and input, so this is not useful for debugging grammars, only the parser generator itself or for testing JavaScript engine sanity.

PanPG.explain(grammar,opts,input,[verbosity])

When a particular grammar and input do not give the parse tree (or error!) that was expected, explain can be used to determine what the parser did in excruciating detail. The arguments are the original PEG grammar, the options as they would be used with generateParser, and the sample input. The output is a human readable string describing what the parser did at every step, in several sections. ASCII-art is used, so the string should be displayed in a monospace font.

The grammar is required because this function recompiles the parser with tracing enabled, and then runs it over the input and analyzes the trace so generated. The compiled grammar is cached by the explain function itself, so if called again with only the input text changed, it will not generate the parser code again. Nonetheless, this function can be quite slow and generates truly voluminous output, so care is advised and small inputs are recommended. As an example, with the JSON grammar and a sample input of 706 bytes, explain generates over 6 MB of output including a complete trace of the full parser state at every change, and an analysis of the trace which shows the order and nesting of each rule that was tested at each position in the input. The start option can be used to specify a particular parse rule to isolate issues to the relevant part of the grammar.

The optional verbosity argument increases the verbosity. The default verbosity is 1. If verbosity is 2, the trace will be shown. If verbosity is 3, the generated parser code and explanatory source code comments will also be included. If verbosity is 4, the raw parse tree (an array of numbers) will also be shown.

Parsing and Parse Trees

Once a parser is generated using the API above, it can be used to parse input, returning a parse tree.

This is used in the arithmetic parsing demo. An example with a more complex grammar is provided by the ECMAScript 5 demo.

The generated parsers consist of a single function, or if the CommonJS output is used, a module which exports a single function. The function takes a string to be parsed and returns either a parse tree or an error.

The return value is an array, either of the form [true, <parse tree>], or [false, <pos>, <error>, <input>]. The format of these arrays may change in future versions, but the first element will always be a Boolean indicating success. The format of the parse tree is described below. In the case of a parse error, the <pos> is the position in the input at which parsing failed.

Once a parse result is in hand, it may be passed to showTree or treeWalker. These functions are contained in PanPG_util.js. The showTree function takes the output of a parser, which may be either an error or a parse tree, and displays errors in human-readable form and parse trees as ASCII art. treeWalker takes a parse result and a set of callback functions and walks the parse tree.

Parse Tree Format

Rather than embedded code which is executed on parser rules, the generated parser returns a parse tree which can be processed further as desired. Most users will not deal with the parse trees directly but will prefer the treeWalker function, documented below, or some other API.

The parse tree format is designed to use little memory and to allow streaming. Streaming parsing is not yet automatically supported but is in the works, see the project roadmap.

For complete details on the parse tree format see its documentation.

PanPG_util.showTree(result, opts)

  • result is the return value of a parser.

  • opts is an object which may have the following properties:

    • elide is an array of rule names to be hidden in the output.

    • drop is an array of rule names to be completely suppressed including descendant nodes.

showTree takes a parse result and returns a pretty-printed ASCII representation of the parse tree.

In the event that the parse failed, showTree calls showError which returns a human-readable description of the error including the line on which the failure occurred. Note that the error will be reported at the position of the last alternative to be tried, which may be earlier in the input than the actual error, depending on the grammar.

The following example shows a short JavaScript program, and the parse tree that the ECMAScript 5 parser generates when parsing this program. This tree contains parse nodes that are not interesting for most uses, so passing {elide:['anonymous','IdentifierStart','UnicodeLetter',...]} to showTree would give a more concise output. Once a grammar is complete and correct, rather than just hiding the uninteresting nodes in showTree, the uninteresting rule names (for a particular use of the grammar) can be passed to the parser generator to produce a parser that simply does not generate those nodes.

function f(){return 42}

Program 0-25 "function f(x){re…"
 FunctionDeclaration 0-25 "function f(x){re…"
  FunctionTok 0-8 "function"
  S 8-9 " "
   WhiteSpace 8-9 " "
  Identifier 9-10 "f"
   IdentifierName 9-10 "f"
    IdentifierStart 9-10 "f"
     UnicodeLetter 9-10 "f"
  anonymous 10-11 "("
  FormalParameterList 11-12 "x"
   Identifier 11-12 "x"
    IdentifierName 11-12 "x"
     IdentifierStart 11-12 "x"
      UnicodeLetter 11-12 "x"
  anonymous 12-13 ")"
  anonymous 13-14 "{"
  FunctionBody 14-24 "return x*x"
   Statement 14-24 "return x*x"
    ReturnStatement 14-24 "return x*x"
     ReturnTok 14-20 "return"
     SnoLB 20-21 " "
      WhiteSpace 20-21 " "
     Expr 21-24 "x*x"
      AssignmentExpr 21-24 "x*x"
       ConditionalExpr 21-24 "x*x"
        LogicalOrExpr 21-24 "x*x"
         LogicalAndExpr 21-24 "x*x"
          BitwiseOrExpr 21-24 "x*x"
           BitwiseXOrExpr 21-24 "x*x"
            BitwiseAndExpr 21-24 "x*x"
             EqualityExpr 21-24 "x*x"
              RelationalExpr 21-24 "x*x"
               ShiftExpr 21-24 "x*x"
                AdditiveExpr 21-24 "x*x"
                 MultiplicativeExpr 21-24 "x*x"
                  UnaryExpr 21-22 "x"
                   PostfixExpr 21-22 "x"
                    LeftHandSideExpr 21-22 "x"
                     NewExpr 21-22 "x"
                      MemberExpr 21-22 "x"
                       PrimaryExpr 21-22 "x"
                        Identifier 21-22 "x"
                         IdentifierName 21-22 "x"
                          IdentifierStart 21-22 "x"
                           UnicodeLetter 21-22 "x"
                  MultiplicativeOp 22-23 "*"
                  UnaryExpr 23-24 "x"
                   PostfixExpr 23-24 "x"
                    LeftHandSideExpr 23-24 "x"
                     NewExpr 23-24 "x"
                      MemberExpr 23-24 "x"
                       PrimaryExpr 23-24 "x"
                        Identifier 23-24 "x"
                         IdentifierName 23-24 "x"
                          IdentifierStart 23-24 "x"
                           UnicodeLetter 23-24 "x"
     EOS 24-24 ""
  anonymous 24-25 "}"

PanPG_util.treeWalker(callbacks, result)

treeWalker takes a set of callback functions and a parse result, and walks the result parse tree, calling the callbacks provided for each node in the parse tree.

Callbacks are named according to the corresponding parse rule in the grammar.

A callback is called by the walker when a node ends (post-order traversal), so the top-level rule or start rule of the grammar (corresponding to the parse of the entire input) will be the last callback to be called.

Each callback is called with two arguments: a match object and a child array. The match object provides information about the segment of the input matched by the node. The child array is not an array of parse nodes, but an array of return values from callback functions of the child nodes. The match object has a text() method which returns the matched segment of the input string. (It is provided as a function rather than a string because slicing the match out of the string is a relatively expensive operation, so generating it only when required speeds up execution in some engines considerably.) The match object also has start and end properties which are indices into the input string, useful for such things as syntax highlighters where the position of the parse node within the input string is what is relevant, rather than the contents that were matched.

Five special callbacks can be provided; these are anonymous, other, fail, exception, and warn. All are optional.

anonymous is called on anonymous nodes in the parse tree. These can be seen by passing {hide:[]} options to showTree.

other will be called on named nodes for which no specific callback was provided. In addition to the match and child array arguments described above for named callbacks, it is also called with the name of the rule as a third argument.

fail will be called on parse failure, with the unsuccessful parse result as the only argument.

exception will be called in case an exception is thrown by any callback. Otherwise the exception will propagate normally and will terminate the walk. If the exception callback returns true, the walk will continue, otherwise the treeWalker call will return immediately.

warn will be called when potential errors are detected, such as a callback returning a value when there is no callback for the parent node to receive that value. It is recommended to pass a logging or alert function as the warn callback especially when debugging.

When a callback returns a value, it is appended to an array, and that array is provided to the callback for the parent node. If none of a node's children's callbacks returned a value, then the second argument will be an empty array.

If a callback for the top-level node returns a value, it will be used as the return value of the treeWalker call itself.

There is an example use of the tree walker API in the arithmetic demo.

Source Tree Layout

doc/ contains plain text documentation which may or may not be useful or current. The most current documentation is the source code itself, and the files under doc/ that are pointed to in source code comments.

build/ contains generated files and generated code. This includes the library files themselves, PanPG_generator.js and PanPG_util.js, which are concatenations of other source files and generated files, the generated versions of some of the example grammars that ship with the library (JSON, ES5, and the PanPG PEG grammar itself), built versions of dependencies such as the cset library and its generated Unicode character class tables. Some files, notably parsePEG.js and cset_prod.js, are both built by the project, and depended on by it, so the build/ directory cannot be wiped out and reconstructed easily from the source tree. Currently the build system is browser-based and somewhat arcane, so to do your own build from source you'll probably need to write a non-browser-based build script, perhaps starting from build.js and build_support.js.

grammars/ contains example grammars for JSON, ECMAScript, and the PEG language accepted by PanPG itself.

src/ contains the source files. To read the code start from the API entry points in the API_* files and follow the dependencies from there.

About

PanPG is a parser generator which reads a PEG description of a grammar and writes a packrat parser in pure JavaScript.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published