Skip to content

zsmoore/lexr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lexr

Build Status via Travis CI NPM version NPM Downloads
Dependency Status devDependency Status
Coverage Status Maintainability

lexr is meant to be a lightweight tokenizer built in Javascript to be more modern and clean than the C ancestor.

Goals

lexr is compartmentalized to be able to work on its own however is aimed to be used hand in hand with grammr once the project is finished.
lexr and grammr are an effort to re-think how the traditional process of flex and bison work together and move development to a more modern process.

Both of these projects are developed in order to be implemented in the re-work of ivi a project which aims to visualize code for the purpose of teaching intro programming.

Features

The current Lexical Analyzer has built-in support for Javascript with a plan on extending to other languages.
If you do not see your language supported or would like to simply use custom tokens it is possible to do so as well.

What is currently supported is

  • Using built-in or custom tokens
  • Adding Tokens either one by one or in a set to the tokenizer
  • Error Detection
  • Functions on Token Recognition
  • Overwriting error token name
  • Removing Tokens from the token set
  • Ignoring tokens for output either one by one or in a set
  • UnIgnoring tokens from the token set
  • Custom Output for tokens
  • Removing custom output for tokens

Built-In Language Support

  • Javascript
  • Json

Usage

The entire library wraps around a Tokenizer class.
First import the library

let lexr = require('lexr');

In order to use built-in languages, initialize the tokenizer as so:

let tokenizer = new lexr.Tokenizer("Javascript");

If you would like to use fully custom tokens then simply initialize as so:

let tokenizer = new lexr.Tokenizer("");

If you have selected a built-in language you will not be able to add or remove tokens until you disable strict mode for tokens.
To do so call the disableStrict() function on the tokenizer instance.

Tokens

Once you have done so or if you are working on a fully custom tokenizer you can add tokens 2 ways:

// Add a single token
// Arguments: tokenName, RegExp pattern
tokenizer.addToken("L_PAREN", /\(/);

// Add multiple tokens
// Must be in the form of a Javascript Object
const tokens = {
  L_BRACE : /{/,
  R_BRACE : new RegExp('}'),
};
tokenizer.addTokenSet(tokens);

You can also remove pre-existing tokens if you are using a custom language or have disabled strict mode.

tokenizer.removeToken("L_PAREN");

White Space and New Lines

Tokenizer has ability to ignore whitespace and newlines by calling the methods on the instance.
Example:

let x = lexr.Tokenizer("Json");
x.ignoreWhiteSpace();
x.ignoreNewLine();

Functions

If you would like to add functions when tokens are recognized you can add them through a set or through individual addition by calling the proper addFunction function.

// Add functions through set
let funs = {
  WHITESPACE  : function() { whitespaceCount += 1 },
  IDENTIFIER  : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);

// Add single function
tokenizer.addFunction("NEW_LINE", function() { lineCount += 1 });

// Remove function
tokenizer.removeFunction("IDENTIFIER");

Function Usage

lexr slightly separates itself from flex in how functions are handled.
Your functions should not use any of the information taken from the current token since you have access to that information post tokenization.
This keeps the functions that are being executed smaller and cleaner.

Example Function Usage

An example of code that would go with the functions is as follows

let funs = {
  WHITESPACE  : function() { whitespaceCount += 1 },
  NEW_LINE    : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
let input = `var a = 4;
             var b = 3;`;            
let whitespaceCount = 0;
let idNum = 0;
tokenizer.tokenize(input);

Function Scoping Oddities

Since functions are contained within an object in the tokenizer, scoping can get a bit iffy.
You can use the example above however, a suggested usage is to make a functions.js in order to:

  • declare all of your tokens to functions
  • declare any variables you need outside of the function object
  • export the function object as well as any variables you want access to to the proper files

yytext Function Alternative

Instead of using yytext within your functions the suggested usage is to analyze post tokenization.
An example of grabbing all identifier names and inserting them into let's say a symbol table would look like:

let input = `var a = 4;
             var b = 3;`;
let output = tokenizer.tokenize(input);

let symbolTable = {};
for (let i = 0; i < output.length; i++) {
  if (output[i].token === "IDENTIFIER") {
    symbolTable[output[i].value] = undefined;
  }    
}

Error Token

By default the error token name when detecting an uncaught token will be ERROR however, if you would like to change the name you can do so by calling setErrTok as so:

tokenizer.setErrTok("DIFF_ERROR");

Ignore Tokens

You can also ignore certain tokens from appearing in the output by either calling addIgnore

tokenizer.addIgnore("WHITESPACE");

Or by adding an entire set through an array or an object

let ignore = ["WHITESPACE", "VAR"];
tokenizer.addIgnoreSet(ignore);

// Or through an object which allows true or false
let ignore2 = {
  "WHITESPACE"  : true,
  "VAR"         : false,
};
tokenizer.addIgnoreSet(ignore2);

If you would like to unIgnore tokens programatically just call the unIgnore method

tokenizer.unIgnore("WHITESPACE");

Custom Output

There are options to make your output more verbose by adding a customOut field to the output object.
Similarly to how other operations work you can either add a set of tokens or a single token as well as remove them.

// Add a set of custom outputs
let customOut = {
  "WHITESPACE"  : 2,
  "VAR"         : 'declaration',
}
tokenizer.addCustomOutSet(customOut);

// Add a single custom output
tokenizer.addCustomOut("SEMI_COLON", 111);

// Remove a custom out
tokenizer.removeCustomOut("VAR");

A sample output object would then look like:

{ token: 'WHITESPACE', value: ' ', customOut: 2 }

Tokenization

Lastly in order to tokenize your input code simply call the tokenizer's tokenize method.

let output = tokenizer.tokenize(aString);

Output

In its current form the output from tokenize(aString) will be in the form of a list of Objects each having 2 properties.
token being the token captured, and value being what determined the token.

Examples

Sample Program

let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("Javascript");
let input = "var a = null;";
let output = tokenizer.tokenize(input);
console.log(output);

Output would then be

[ { token: 'VAR', value: 'var' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'IDENTIFIER', value: 'a' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'ASSIGN', value: '=' },
  { token: 'WHITESPACE', value: ' ' },
  { token: 'NULL_LIT', value: 'null' },
  { token: 'SEMI_COLON', value: ';' } ]

Sample Program with Token Errors

let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("");
tokenizer.addToken("PLUS", /\+/);
tokenizer.setErrTok("DIFF_ERROR");
let input = "5+5;";
let output = tokenizer.tokenize(input);
console.log(output);

Output would then be

[ { token: 'DIFF_ERROR', value: '5' },
  { token: 'PLUS', value: '+' },
  { token: 'DIFF_ERROR', value: '5;' } ]

Suggested Workflow

How I suggest development if you are not using built-in languages is to separate each part of the tokenization into its own file.
If you are using a complex language where the regexes can become very large, separate the building up of those regexes to another file and only export the final regex to your token object.

Since the Tokenizer can take in sets of information it is easiest to separate everything and use exports between files.

Sample project structure

+-- src/
| +-- index.js
| +-- functions/
|   +-- functions.js
| +-- tokens/
|   +-- tokens.js
|   +-- regexPatterns.js
| +-- ignore/
|   +-- ignoreTokens.js
| +-- customOut/
|   +-- customOutput.js

Future Features

If there are any good freatures missing feel free to open an issue for a feature request.