Recursive Descent Parsing

Purpose

The goal of this work was to learn more about how compilers work and parse input from a programming language to an internal representation and machine code. Within this repository are two implementations for a recursive descent parser: one written in Python and one written in C. Both define a simple programming language based on arithmetic expressions with a scanner separate from the parser.

Background

Language implementation systems must analyze source code, regardless of the specific implementation approach. Nearly all syntax analysis is based on a formal description of the syntax of the source language (Backus-Naur form, or BNF). The syntax analysis portion of a language processor nearly always consists of two parts:

A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar)
A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)

Lexical and syntax analysis are typically separated because of simplicity (less complex approaches can be used for lexical analysis; separating them simplifies the parser), efficiency (separation allows optimization of the lexical analyzer) and portability (parts of the lexical analyzer may not be portable, but the parser is always portable).

A lexixal analyzer is a pattern matcher for character strings (i.e. a front-end for the parser). It identifies substrings of the source program that belong together - lexemes. Lexemes match a character pattern, which is associated with a lexical category called a token. The lexical analyzer is usually a function that is called by the parser when it needs the next token.

The goals of the parser, given an input program, are to find all syntax errors (for each, produce an appropriate diagnostic message and recover quicky) and produce the parse tree, or at least a trace of the parse tree, for the program. There are two categories of parsers:

Top-down: produces the parse tree beginning at the root

Order is that of a leftmost derivation
Traces or builds the parse tree in preorder

Bottom-up: produces the parse tree beginning at the leaves

Order is that of the reverse of a rightmost derivation
Useful parsers look only one token ahead in the input

The work in this repository focuses on top-down parsers, where given a sentential form, xAα, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A. The most common top-down parsing algorithms are recursive-descent (a coded implementation, used in this repo) and LL parsers (table-driven implementations).

For recursive-descent parsers, there is a subprogram for each nonterminal in the grammar which can parse sentences that can be generated by that nonterminal. Extended Backus-Naur form (EBNF) is ideally suited for being the basis of a recursive-descent parser, because EBNF minimizes the number of nonterminals. The following shows an example of a grammar for simple expressions:

<expr> → <term> {(+ | -) <term>}
<term> → <factor> {(* | /) <factor>}
<factor> → id | int_constant | ( <expr> )

When there is only one right-hand side (RHS):

For each terminal symbol in the RHS, compare it with the next input token; if they match, continue, else there is an error
For each nonterminal symbol in the RHS, call its associated parsing subprogram

A nonterminal that has more than one RHS requires an initial process to determine which RHS it is to parse:

The correct RHS is chosen on the basis of the next token of input (the lookahead)
The next token is compared with the first token that can be generated by each RHS until a match is found
If no match is found, it is a syntax error

The Left Recursion Problem: If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser. A grammar can be modified to remove direct left recursion as follows:
For each non-terminal, A,

Group the A-rules as

A → Aα1 | … | Aαm | β1 | β2 | … | βn

where none of the β‘s begins with A 2. Replace the original A-rules with

A → β1A’ | β2A’ | … | βnA’ 
A’ → α1A’ | α2A’ | … | αmA’ | ε

The other characteristic of grammars that disallows top-down parsing is the lack of pairwise disjointness, the inability to determine the correct RHS on the basis of one token of lookahead.

Def: FIRST(α) = {a | α =>* aβ }
(If α =>* ε, ε is in FIRST(α)

Pairwise Disjointness Test:
For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A → αi and A → αj, it must be true that FIRST(αi) ⋂ FIRST(αj) = Φ. For example:
A → a | bB | cAb
A → a | aB
Left factoring can resolve pairwise disjointness! Replace
<variable> → identifier | identifier [<expression>]
with

<variable> → identifier <new>
<new> → ε | [<expression>]

or
<variable> → identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)

Author

Michael Balas

License

GNU General Public License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
RDP.c		RDP.c
RDP.py		RDP.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

RDP.c

RDP.c

RDP.py

RDP.py

README.md

README.md

Repository files navigation

Recursive Descent Parsing

Purpose

Background

Author

License

About

Releases

Packages

Languages

License

MichaelBalas/Recursive_Descent_Parser

Folders and files

Latest commit

History

Repository files navigation

Recursive Descent Parsing

Purpose

Background

Author

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages