The Tower Bridge in Sacramento, California Parsing Concepts
Grammars and Backus-Naur Form
Main
Latest News
Getting Started
Screen Shots
Download
Documentation
Contributors
Contact
About GOLD
How It Works
FAQ
Why Use GOLD?
Comparison
Revision History
Freeware License
More ...
Articles
What is a Parser?
Backus-Naur Form
DFA Lexer
LALR Parsing
Glossary
Links
More ...


Context Free Grammars

Grammars provide rules that specify the structure of languages, independently from the actual meaning of the content. Grammars are classified according to the complexity of the structure they describe. The class of context free grammars (CFG) is the most common one use to describe the syntax of programming languages. In this class, the category a token belongs to (e.g. reserved words, identifiers, etc.) is what matters rather than the specific token (e.g. the identifier xyz).

In addition, the formatting of the program (the content of whitespace) and the actual text of identifiers does not affect the syntax of the grammar. This is very important in parsing technology. Grammars that are not context free cannot be parsed by the LR, LALR or LL parsing algorithms.

Backus-Naur Form

Backus-Naur Form, or BNF for short, is a notation used to describe context free grammars. The notation breaks down the grammar into a series of rules - which are used to describe how the programming languages tokens form different logical units

The actual reserved words and recognized symbol categories in the grammar represent "terminals". Usually, terminals are left without special formatting or are delimited by single or double quotes. Examples include: if, while, '=' and identifier.

In Backus-Naur Form, rules are represented with a "nonterminal" - which are structure names. Typically, nonterminals are delimited by angle-brackets, but this is not always the case. Examples include <statement> and <exp>. Both terminals and nonterminals are referred to generically as "symbols". Each nonterminal is defined using a series of one or more rules (also called productions). They have the following format:

N ::= s

where N is a nonterminal and s is a series of zero or more terminals and nonterminals.  Different alternatives for rules can be specified in Backus-Naur Form. For readability, often productions are grouped together and  separated by a “pipe” symbol which is read as the word “or”.

In summary , there are slight variations in use, but the notation has the following properties.

  • A rule consists of one or more productions.
  • The production starts with a single nonterminal, which is the name of the rule being defined
  • This nonterminal is followed by a ::= symbol which means “as defined as”. The  ::= symbol is often used interchangeably with the symbol. They both have the same meaning.
  • The symbol is followed by a series of terminals and nonterminals.

The following chart identifies the various parts of a rule definition.

 

For example, the following defines a rule called <Value> which can contain either an Identifier terminal or the contents of another rule called <Literal>

<Value> ::= Identifier | <Literal>
<Literal> ::= Number | String

The <Literal> rule can contain either a  Number  or String   terminal. As a result of this definition, a <Value> can contain an Identifier, Number or String.

Rules can also be recursively defined. The following rule defines a series of one or more Identifiers.

<Identifiers> ::= Identifier <Identifiers>
                | Identifier

Extended BNF

There is another version of BNF called Extended BNF, or EBNF (ISO/IEC 14977), for short. This variant was originally developed by Niklaus Wirth to define the syntax for the Pascal Programming Language. The notation was designed to simplify the notation of BNF by allowing the developer to use special notation for defining lists and optional sets of symbols.

Variations between different versions of EBNF exist, but most use similar notation. Square brackets "[ ... ]" are used to denote optional elements of a rule. Elements of a rule can also be grouped together using braces "{ ... }" which denotes a repetition of zero to infinity. Symbols can also be grouped using parenthesis "( ... )" and followed by a Kleene closure. In this case, the semantics are identical to those used in Regular Expressions.

This format, while powerful, creates a number of implied rules. For instance, if the programmer was to define a rule with an optional clause, the system would have two distinct forms of the rule - the one with the clause and one without. This is also the case with lists and other enhanced features. For instance, the If-then-else statement could be defined as:

<If Stm> ::= IF <Expression> THEN <Stms> [ ELSE <Stms> ]

This would create the following rules:

<If Stm> ::= IF <Expression> THEN <Stms>
           | IF <Expression> THEN <Stms> ELSE <Stms>