LRX parser generator for C++

A.M.D.G.

Home Downloads Customers Feedback Theory Documentation Contact
Acknowledgments Installation LRX DFA Definitions

LRX 16.0 User's Manual

Section 1.  Introduction

LRX's main advantage is LR(k) parsing. I call it LR(*) parsing because it is actually nondeterministic LR(k) parsing. It does not usually impact the parsing speed because it only does LR(*) parsing in conflict states (i.e. states with shift-reduce or reduce-reduce conflicts). Usually there are not many conflict states.

Section 2.  Program Options

If you type the program name, "LRX".  Here is what you get:

LRX 16.0.001 32b Copyright Paul B Mann.
|
|   LR(*) PARSER GENERATOR
|
|   lrx <grammar> [/<option>+]
|
|   OPTION  DEFAULT  DESCRIPTION
|   c           0    Conflict report (all conflicts)
|   ca          0    Conflict analysis (trace backs)
|   crr         0    Conflict report for Reduce-Reduce conflicts
|   csr         0    Conflict report for Shift-Reduce conflicts
|   d           0    Debug option for the generated parser
|   g           0    Grammar listing (with numbers)
|   gh          0    Grammar listing (HTML file)
|   k           1    Number of lookaheads for LR(*) parsing
|   m           0    Minimize parser-table size
|   o           0    Optimize parser for speed
|   st          0    State-machine listing
|   v           0    Verbose mode (v=1,2,3)
|   w           0    Print warnings on screen
|_

'c' option: Show all conflicts.

This option tells LRX to show shift-reduce and reduce-reduce conflicts.  .

'ca' option: Show a conflict analysis for each conflict.

This option tells LRX to show a tract back type of explanation for the conflicts.  .

'crr' option: List reduce-reduce conflicts.

This option tells LRX to list conflicts for reduce-reduce conflicts.  .

'csr' option: List shift-reduce conflicts.

This option tells LRX to list conflicts for shift-reduce conflicts.  .

'd' option: Debug option turned on in the parser.

This option tells LRX to turn on the DEBUG_PARSER defined constant for use by "parser.cpp".

'g' option: Output a grammar listing with rule numbers.

This option tells LRX to output a file with a complete grammar listing.  The output file is named ?.grm.grammar.txt..

'gh' option: Output a grammar listing in HTML format.

The output file is named ?.grm.grammar.html..

'k' option: Number of lookaheads (e.g. /k=5).

This option tells LRX how many lookahead symbols to examine to make a parsing decision if it needs to enter LR(*) parsing mode at runtime. A k greater than 1, activates LR(*) parsing..

'm' option: Minimize parser-table size (e.g. /m).

This option tells LRX to experiment with 4 different orientations to find the smallest size for the parser.

'o' option: Optimize parser speed (e.g. /o).

This option tells LRX to do chain-reduction elimination in the state machine and output this optimized parser-tables data structure. You can expect to gain a speed increase of 15% in the parser/lexer process (i.e. syntax-checking only).

'st' option: Output a state-machine listing. 

This option tells LRX to list the state machine with conflicts in the ?.grm.states.txt file. 

'v' option: Verbose mode screen display.

This option tells LRX to display more information on the screen and in the ?.grm.log.txt file. V=2 is basic information. V=2 is advanced information. V=3 is expert level information. With v=3, you get useful information about how much memory was used by LRX. You may need to increase some of the limits if your grammar in incredibly huge. LRX will tell you if it reaches a limit. 

'w' option: Warning messages.

This option tells LRX to display the warning messages on the screen.  The default is 'w=0'


Section 3.  Some Examples.

LRX is a command-line program.  Here are some examples of usage:

   LRX ansic
   LRX ansic /v=3 /c /st /d
   LRX ansic.grm /v=2 /o /m /k=25

Note: command-line arguments start with a "/" or "-" or "!".  The "/" indicates: turn on this option.  The "-" indicates turn on this option. The "!" means turn OFF this option (i.e.  "!ast" means turn off the AST creation). 

If you want to specify defined constants (for use in your grammar or source code), you should put them in the .grm file instead of the .lgr file. Defined constants are NOT necessary and NOT recommended. Not everyone knows what LCURLY means, so it's better to put '{' in your grammar. If you put IDENTIFIER in your grammar instead of <identifier> then you cannot have IDENTIFIER as a keyword of your language. Some languages have uppercase keywords, so <identifier> is prefered.

But if you really want defined constants, you should put them at the top of your .grm file as follows:

/* calc parser grammar. */

   IDENTIFIER   <identifier>     =>     lookup();
   INTEGER      <integer>        =>     lookup();
   EOF          <eof>            ;

   PROGRAM      'program';
   IF           'if';                 
   ENDIF        'endif';
   THEN         'then';
   ELSE         'else';

   EQ           '==';
   NE           '!=';
   PLUS         '+';
   MINUS        '-';
   MUL          '*';
   DIV          '/';
   LCB          '{';
   RCB          '}';
   ASSIGN       '=';
   SEMICOLON    ';';
   LPAR         '(';
   RPAR         ')';

The action operator (=>) means call this action at parsing time. Parse actions are called as rules are recognized in a bottom-up order.  This is good for simple parsing applications, however, a much more powerful action operator is available with LRX. 

The action operator (*>) means build an AST during parsing and after parsing, call this action while traversing the AST.  Consequently, AST actions are called in a top-down order.  This makes life much easier for more complex parsing applications, such as compiler-front-ends.  Here is the more powerful grammar:

Goal     -> Program+  <eof>                   *> goal_
	 			
Program  -> 'program' PgmName '{' Stmt+  '}'  *> program_(2)
	 			
PgmName  -> <identifier>		
	 			
Stmt     -> Target '=' Exp ';'                 *> store_
         -> 'if' RelExp Then 'endif'           *> if_
         -> 'if' RelExp Else 'endif'           *> if_
         -> 'if' RelExp Then2 Else2 'endif'    *> if_
	 			
Target   -> <identifier>                       *> target_(1)~
	 			
RelExp   -> Exp '==' Exp                       *> eq_
         -> Exp '!=' Exp                       *> ne_
	 			
Exp      -> Exp '+' Exp                        *> add_
         -> Exp '-' Exp	                       *> sub_
         -> <integer>                          *> int_(1)
         -> <identifier>                       *> ident_(1)
         -> '(' Exp ')'		
	 			
Then     -> 'then' Stmt+                      *> then_
Else     -> 'else' Stmt+                      *> else_
Then2    -> 'then' Stmt+                      *> then2_
Else2    -> 'else' Stmt+                      *> else2_

Section 4.  Output Files.

LRX creates six output files when analyzing a grammar.  For example, if your grammar is calc.grm, then you will get these output files:

calc.grm.conflicts.txt - A conflicts file, showing the conflicts in your grammar. 
calc.grm.grammar.txt - A grammar file, showing your grammar and the rule numbers. 
calc.grm.grammar.html - A grammar file, HTML format. 
calc.grm.log.txt - A log file, showing a log of what happened. 
calc.grm.states.txt - A state-machine file, showing the parser state machine.
calc.grm.warnings.txt - A warnings file, showing current warning message for your grammar.
calc.lex - A list of tokens and their token numbers, to be read by DFA. 


Section 5.  Grammar Symbols

Grammar Symbols

Grammar symbols are the "words" of your grammar which define the syntax of your computer language.  Here is a list of the different types of symbols allowed in a grammar:

Alpha Symbol        Goal, IfStatement, while, switch, case_statement, target_

Lexical Symbol      <identifier>, <number>, <eof>, <string>

Semantic Symbol     {typedef}, {program_name}, {function_name}

Integer             1, 256, 65535, 32767, 1000000

Literal Symbol      'program', 'if', 'else', 'while', '\'true\'', '\"true\"'

Reserved Grammar Symbols

Only three grammar symbols are reserved and have a special meaning:

<eof>

The end-of-file terminal symbol (i.e.  <eof>).  For example:

Goal   -> Statements <eof>

<error>

Used as a terminal, for an input symbol that causes a syntax error.  This can be used to pass over symbols that you do not care about. 

<error>, [<error>]+

<keyword>

Used as a terminal, to represent all keywords of your language (i.e.  all keywords that are not reserved).  A reserved keyword is specified as: FORMAT^ at the top of your grammar. 

<keyword>, [<identifier>|<keyword>]/','+

Grammar Operators and Punctuators

Grammar operator and punctuators are used to further define the syntax of your computer language.  Here is a list of the different types of operators and punctuators allowed in a grammar:

Rule Notation

'->'     "is a" (starts the right side of a rule).
'.>'     "alternate choice with priority" (useful for: "if expr then stmt else stmt")

':'      "is a" (starts the right side of a rule, same as '->').
'|'      "alternate choice" for another rule.
'.|'     "alternate choice with priority" (useful for: "if expr then stmt else stmt")

';'      "end of a group of rules for a nonterminal" (';' is optional).

Action Operators

Action operators are placed at the end of a rule.  Only one action operator may be placed at the end of a rule.

'=>'     "parse action" (call this function at parsing time).
'+>'     "make a node" (make a node in the AST).
'*>'     "make a node" and "call a node action" (during AST traverrsal).

'=+>'    "parse action" and "make a node".
'=*>'    "parse action", "make a node" and "call a node action".

EBNF Operators

'+'      "one or more occurrences".
'*'      "zero or more occurrences".
'?'      "optional occurrence".
		   
'('      "group start".
')'      "group end".
		   
'['      "optional start".
']'      "optional end".
		   
'|'      "or", inside of a group.
<identifier>/','+    "one or more <identifier>s, separated by commas"

Operator Precedence Notation

'{'      "operator precedence start".
'}'      "operator precedence end", e.g.  { '+' '-' }.

'<<'     "left associative",  used after a list of operators.
'>>'     "right associative", used after a list of operators.

Reserved Word Designator

'^'      "reserved word", placed after a keyword being declared.

Note, the reserved-word indicator '^' should be placed after the keyword when being declared at the top of your grammar.  This is only useful when you are using the reserved symbol, <keyword>, in your grammar.  If you use <keyword> in your grammar, LRX will create a nonterminal for you, <keyword>, that is defined to be a complete list of all the keywords found in your grammar EXCEPT for those keywords you have designated to be reserved with the '^' operator.  Most computer languages will not need this feature.

Reverse the order of the child nodes

'~'      "reverse the node's child nodes".

For example:
Stmt     -> Target '=' Exp ';'   *> store_~		

See the Calc project. The '~' character is placed just after the node name, "store_". In this case, the order of the child nodes will be reversed, which is what you want when doing code generation.

Here are some useful combinations of EBNF notation:

(A|B|C)        A or B or C.
(A|B|C)+       one or more occurrence of A or B or C.  
(A|B|C)*       zero or more occurrence of A or B or C.  
[A|B|C]+       zero or more occurrences of A or B or C.

Color/','+     a comma separated list of Colors.  
(X|Y|Z)/';'+   a list of X or Y or Z separated by semi-colons.  
[X|Y|Z]/';'+   an optional list separated by semi-colons.

Here is a sample of EBNF for some SQL rules:

SelectClause -> SELECT [ALL|DIST] SelectList FromClause 
                   [Where|Groupby|Having|Intersect|Orderby]+

Where        -> WHERE SearchCond 
Groupby      -> GROUP BY ColumnList 
ColumnList   -> ColumnSpec /','+  
Having       -> HAVING SearchCond 
Intersect    -> INTERSECT StmtName
             -> MINUS	StmtName
             -> UNION	StmtName
             -> UNION ALL StmtName

Orderby      -> ORDER BY ColumnSpecList
             -> FOR UPDATE OF ColList

Comments

Comments are allowed in the same style as C++.  Line comments start with '//' and end with the end-of-line character, for example:

// C++ Grammar.
// by John Smith.
// in August 2007.

Block comments start with '/*' and end with '*/'.  These cannot be nested (one inside of another).  Here is an example:

/* Removed 10-15-04 NKP
CreateStmt -> CREATE TableName ColStuffList ';'
-> CREATE TableName UniqueStuff ';' // Added 09-15-04 JKL
*/

Nonterminal Symbols

Nonterminal symbols are the those symbols which are defined to be a sequence of other symbols.  Nonterminals may be composed of letters, numbers and underscores ('_').  The first character cannot be a number.  Here is a list of valid nonterminal symbols:

Start, IfStatement, For_loop, Expression1, Expression_2

Terminal Symbols

Terminal symbols are those symbols of the grammar that are pre-defined, such as keywords, identifiers, numbers, operators, punctuators, character strings, etc.  Most of these symbols are defined in the lexical grammar and recognized by the lexer.  It is this separation of the parser and lexer that allows one to define a grammar without lots of ambiguity.  The concept of a separate lexer also presents a coherent set of primitive symbols to someone learning the language.

<identifier>, '=', for, while, '&&', <eof>, <string>, {typedef}, {function_name}.

Keywords

Keywords are composed of letters, numbers and underscores ('_').  The first character must be a letter.  Keywords may be specified as literal symbols using single quotes (e.g.  'while'), but this is not necessary unless they contain some special character (e.g.  '<html>').  Here is a sample showing some keywords:

Type        -> char | int | short | unsigned 
MetaStart   -> '<META'
MetaItem    -> 'HTTP-EQUIV' '=' Value
            ->	NAME	'=' Value
            ->	CONTENT	'=' Value

Literal Symbols

Literal symbols are symbols that begin and end with a single quote (').  Any character can be specified within single quotes, except end-of-line, tab, space, single quote ('), double quote (") and backslash (\).  To have these unallowed characters in a literal, they must be preceded by a backslash (\), for example: '\'', '\"', '\\', etc.  Here is an example:

PostfixExp  -> PrimaryExp
            -> PostfixExp '[' Exp ']'
            -> PostfixExp '(' [Exp] ')'
            -> PostfixExp '->' Name

Lexical Symbols

Lexical symbols are those symbols of the grammar such as <identifier>, <number>, <string>, etc.  These have no literal form.  For example, all words made up of letters, such as WinMain, CloseFile, MsgBox, etc.  are usually represented in a grammar by <identifier>.  These are tokens recognized by the lexer.  Lexical symbols begin with '<' and end with '>'.  Inside the angle brackets, you may use letters, numbers and underscores.  Here are some examples:

Parameter   -> <identifier>
            -> <integer>
            -> <hexdigits>
            -> <string>

Semantic Symbols

Semantic symbols begin with '{' and end with '}', (e.g.  {typedef}).  They are variable symbols of the grammar that are defined by the lexer to be one terminal (e.g.  <identifier>) and need to be changed later into another terminal (e.g.  {typedef}).  For example, a function name, "main", is first recognized to be an <identifier>, but later it may need to become a {function_name}.  Another classical case is an <identifier> that is defined to be a 'typedef' in the C language.  Here is a grammar segment that solves the 'typedef' problem:

<identifier> => lookup ();

Declaration  ->         VarDecl1 /','+  ';'
             -> typedef VarDecl2 /','+  ';'

VarDecl1     -> Type+  <identifier>
VarDecl2     -> Type+  <identifier>   => typedef_(2, {typedef})

Type         -> SimpleType+
             -> Type '*'

SimpleType   -> char
             -> int
             -> short
             -> unsigned
             -> {typedef}

Here is a difficult input statement that will be parsed correctly without any errors:

typedef unsigned int uint, uint* uintptr;

In this line, when 'uint' is encountered it will be recognized as an <identifier> by the lexer.  At the comma (,) 'uint' will be changed to a {typedef} by the parse action, 'defterm'.  At the second encounter with 'uint', it will be recognized as a {typedef} by the symbol-table lookup function.  Therefore the parsing is handled correctly (with no ambiguity in the grammar).


Section 6.  Parser-Grammar Sections

Parser grammars (.grm files) have 3 sections:

  • 1.  Token Declarations (optional).
  • 2.  Operator Precedence (optional).
  • 3.  Syntax Rules.

1.  Token Declarations (optional)

Token declarations are specified at the top of a grammar.  Any processing that needs to be done is specified here.  '=>' indicates "call this function".  For example, all <identifier>s need to be looked up in the symbol table.  The symbol-table index number is put on the parse stack and later transferred to a node in the AST.  Here is an example:

// Tokens being stored in the symbol table

   <identifier>      =>	lookup();
   <integer>         =>	lookup();
   <string>          =>	lookup();

If you want to assign defined constants to the tokens, you would do it like this:

// Tokens being stored in the symbol table

   IDENTIFIER    <identifier>      =>	lookup();
   INTEGER       <integer>         =>	lookup();
   STRING        <string>          =>	lookup();

In the above examples, <identifier>, <integer> and <string> are input symbols that we want to be put into the symbol table by calling the function, lookup().  We only have to list those tokens which need to be processed by a token action.

2.  Operator Precedence (optional)

The Operator Precedence section allows one to define which operators have priority over others.  For example, when we are parsing expressions, such as 1+2*3, we want the 2*3 to be calculated first.  We specify this operator precedence as follows:

{ '+' '-' } <<  // lower priority
{ '*' '/' } <<  // higher priority

The operators with higher priority are listed below ones with lower priority.  The '<<' means left associative and '>>' means right associative.  A case of right associativity would be the exponential operator '^', for example 2^2^3 would be calculated as 2^8, instead of 4^3.  We want the 2^3 to be computed first.  This operator usually has higher priority than the '*' and '/' operators.  The operator precedence would look like this:

{ '+' '-' } << // left associative
{ '*' '/' } << // left associative
{ '^'     } >> // right associative

3.  Syntax Rules

A rule is a nonterminal symbol followed by a '->' and zero or more symbols.  Any null production (empty rule) should be listed first for readability, but this is not strictly enforced.  Rules have this format:

HeadSymbol ('->' (Nonterminal | Terminal)+ )+  

Goal Symbol or Start Symbol

The first rule of a grammar must define the goal symbol.  The right side of the goal production contains one symbol or one expression and the end-of-file symbol (e.g.  <eof>).  The end-of-file symbol cannot be used anywhere else in the syntax-rules section of the grammar.  Here are some examples of start definitions:

Start -> Input <eof>
Goal  -> CompilationUnit+  <eof> 

Nonterminals

A HeadSymbol (nonterminal symbol) definition may contain many rules.  Here is an example taken from a C grammar.

Stmt -> ';'
     -> AssignmentExp ';'
     -> goto Identifier ';'
     -> return [Exp] ';'
     -> if '(' Exp ')' Stmt
     -> if '(' Exp ')' Stmt else Stmt
     -> switch '(' Exp ')' '{' Cases Default '}'
     -> while '(' Exp ')' Stmt
     -> do Stmt while '(' Exp ')' ';'
     -> for '(' Exp1 ';' Exp2 ';' Exp3 ')' Stmt

Actions

Any rule may have an action attached to it (at the end of the rule).  There are 5 types of actions:

=>    Call a parse-action.
+>    Make a node in the AST.
*>    Make a node and call a node-action (during AST traversal).
=+>   Call a parse-action and make a node in the AST.
=*>   Call a parse-action, make a node and call a node-action.

=> Parse action

Parse actions are done by the parser when a rule is recognized.  When the last symbol in a rule is accepted, the parser makes a reduction and calls the function indicated by the '=>' operator.  Here is the syntax:

'=>' FunctionName '(' [Arg1 [, Arg2]] ')'

Arg1 must be an integer which points at a symbol in the rule (i.e. 1, 2, 3, etc).

Arg2 may be one of the following:

Defined Constant   (e.g.  INTEGER)
Terminal symbol   (e.g.  <identifier>)
Semantic symbol   (e.g.  {typedef})
Integer   (e.g.  100)

+> Make a node in the AST

Make-a-node operation occurs when the parser makes a reduction of a rule.  At this time, the parser creates a new parent node and attaches any children nodes to it as may occur in the rule being recognized.  Here is an example of how to do expressions, taken from the calc.grm:

Exp   -> Primary
      -> Exp '+' Exp	   +> add
      -> Exp '-' Exp	   +> sub
      -> Exp '*' Exp	   +> mul
      -> Exp '/' Exp	   +> div

*> Make a node and call a node action

A Make-a-node operation occurs when the parser makes a reduction of a rule.  At this time, the parser creates a new node and attaches the children nodes found in the right side of the rule.  Here is the syntax:

*> NodeName
*> NodeName '(' ')'
*> NodeName '(' [Arg1 [, Arg2]] ')'

Arg1 must be an integer which points at a symbol in the rule (i.e. 1, 2, 3, etc).

Arg2 must be one of the following:

Semantic symbol   (e.g.  {typedef})

=+> Call a parse action and make a node

This is the combination of two operations, which is more concise than having a => and a +> for the rule.  In this case (=+>) the parse action name will be the same as the node name. 

=*> Call a parse action , make a node, and call a node action

This is the combination of three operations.  Some situations may require this kind of thing.  Usually, the parse action name will be the same as the node name and the node-action name will be the same as the node name. 

Assignment  -> Target '=' Exp ';'   =*> assignment_()~  

In the above case we will need a parse action called 'assignment', and a node action called 'assignment'. These must be supplied in the "calc_actions.cpp" source file.  See sample project 'calc'.


Section 7.  Useful Examples

Example 1.  Delayed Typing Of <identifier>s

There are some occasions in which it is very useful to delay the typing of symbols until the whole rule is recognized.  This allows one to avoid conflicts in the grammar or avoid writing code that does a lookahead to resolve a conflict at parsing time.  Here is a classical example often presented as a reason why you need infinite lookahead to resolve this ambiguity in the C language:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl     -> Type FuncDeclName '(' Arg/','+  ')' ';'
FuncDef      -> Type FuncDefName  '(' Arg/','+  ')' '{' Stmts '}'

FuncDeclName -> <identifier>             => funcdecl_(1)
FuncDefName  -> <identifier>             => funcdef_(1)

In the above grammar there is a reduce-reduce conflict.  At the input symbol '(' the parser cannot decide whether <identifier> is a FuncDeclName or a FuncDefName.  In fact, it cannot make this decision until it sees the ';' or the '{'.  We can get rid of the ambiguity (conflict) by re-writing the grammar this way:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl     -> Type <identifier> '(' Arg/','+  ')' ';'
FuncDef      -> Type <identifier> '(' Arg/','+  ')' '{' Stmts '}'

In this new grammar, there is no conflict, however, there is no parse actions for either case.  We must delay the action and perform it later during traversal of the AST.  This is not a problem, however, it may seem strange to you.  You can learn to do it this way.  I think it's better than doing a lookahead for the ';' or the '{'.  We need to incorporate the appropriate node actions in the grammar as follows:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl  -> Type <identifier> '(' Arg/','+  ')' ';'            *> funcdecl_(2)
FuncDef   -> Type <identifier> '(' Arg/','+  ')' '{' Stmts '}'  *> funcdef_(2)

Now, we will have two different nodes in the AST, one called 'funcdecl_' and another called, 'funcdef_'.  The <identifier> will be attached to the node via its symbol-table-index.  At this time we can do appropriate processing of the names.  This method avoids the lookahead problem.

(c) Copyright Paul B Mann 2020.  All rights reserved.