


Regular ExpressionsA Regular Expression is a simple, yet powerful, notation that is used to represent simple patterns. They are used extensively in programming language theory. In particular, Regular Expressions are used to describe the "terminals" of a programming language. The term "terminal" refers to the reserved words, symbols, literals, identifiers, etc... which are the basic components of a programming language. When a program is analyzed, the text is chopped into different logical units by the tokenizer. The tokenizer produces a number of "tokens" which contain the same information as the original program. Of course, the tokenizer has the ability to ignore information such as comments. While terminals are used to represent the classification of information, tokens contain the actual information. Essentially, the category of token is its associated terminal. Regular expressions are used to describe these kind of patterns. The notation consists of expressions constructed from a series of characters. Subexpressions are delimited by using parenthesis '(' and ')'. The verticalbar character '' is used to denote alternate expressions. Any of these items, can be followed by a special character that specifies the number that can appear in sequence.
Many scanner (lexer) generators and parsing systems have expanded the notation to include set literals and sometimes named sets. In the case of Lex, literal sets of characters are delimited using the square brackets '[' and ']' and named sets are delimited by the braces '{' and '}'. For instance, the text "[abcde]" denotes a set of characters consisting of the first five letters of the alphabet while the text "{abc}" refers to a set named "abc". This type of notation permits a shortcut notation for regular expressions. The expression (abc)+ can be defined as [abc]+ . Examples
