The Tower Bridge in Sacramento, California Getting Started
Terminals & Regular Expressions
Main
Latest News
Getting Started
Screen Shots
Download
Documentation
Contributors
Contact
About GOLD
How It Works
FAQ
Why Use GOLD?
Comparison
Revision History
Freeware License
More ...
Articles
What is a Parser?
Backus-Naur Form
DFA Lexer
LALR Parsing
Glossary
Links
More ...


What is a Terminal?

When talking about grammars, a terminal is any symbol, identifier, reserved word, etc... that describes a piece of data found in the  language. Essentially, anything that can be recognized as a meaningful piece of lexical information is a terminal.

What is considered a terminal varies greatly between programming languages. In most cases, there is significant overlap. For instance, the following piece of text would specify a simple mathematical expression in most languages:

1 + 2 * 3

In this example, the terminals are '1', '+', '2','*', and '3'. Often, the format of terminal is specified using a regular expression. This is true of GOLD and many other parsing systems.

Regular Expressions

Introduction

A regular expression is a simple, yet powerful, notation that is used to represent simple patterns. They are used extensively in programming language theory. In particular, regular expressions are used to describe the "terminals" of a programming language.

The notation is fairly easy to master. Each expression consists of a series of characters and sub-expressions delimited using parenthesis '(' and ')'. Any of these items, can be followed by a special character that specifies the number that they can appear in a sequence. 

The vertical-bar character '|' is used to denote alternate expressions.

Special Symbols
Zero or more
One or more
Optional


ab*
The asterisk * denotes zero or more of the selected characters. For instance, the example to the left defines a regular expression that accepts a letter 'a' followed by zero or more letter 'b's. The text "a", "ab", "abb", "abbb", etc... would be accepted.

ab+
The plus symbol + denotes one or more of the selected characters. Unlike the last example, because the plus requires one or more 'b's, a single "a" will not be accepted. In other words, the text "ab", "abb", "abbb", etc... would be accepted.

ab?
Finally, the question mark ? is used to indicate the selected characters are optional. In other words, they can appear zero or 1 times. In this example, only the text "a" and "ab" would be accepted.

Notation Variations

Many  parsing systems have expanded the notation to include set literals and sometimes named sets. In the case of Lex, literal sets of characters are delimited using the square brackets '[' and ']' and named sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet while the text "{abc}" refers to a set named "abc". This type of notation permits a short-cut notation for regular expressions. The expression (a|b|c)+ can be defined as [abc]+. 

It should be noted that standard set notation uses curly brackets to denote a set, not the name of a set. As a result, the notation used by most parsing systems differs quite a bit from the official notation.

Examples

Defintion Possible Terminals
Example1 = abc* ab, abc, abcc, abccc, abcccc, ...
Example2 = ab?c abc, ac
Example3 = a|b|c a, b, c
Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, a22b, a111b, ...
Example5 = '*'+ *, **, ***, ****, ...
Example6 = {Letter}+ cat, dog, Sacramento, ...
ListFunction = c[ad]+r car, cdr, caar, cadr, cdar, cddr, caaar, ...
ListFunction = c(a|d)+r The same as the above using a different, yet equivalent, regular expression.

 

Next: Syntax & Backus-Naur Form