Example: Groups

Non-Noise Groups

The group approach is designed to allow the developer to create any number of groups. The most common, naturally, will be the block comment. However, there will be situations where the developer will want to create a group that will be recognized by the parser as a regular terminal.

The terminals 'Comment' and 'Whitespace' are automatically defined as whitespace, and, therefore, ignored. However, these are the only two. Any other groups will create regular terminals. For instance, a developer might want to create a language that allows a variable to be assigned a string literal or HTML code.

So, if the developer wants to create a HTML block in code, they can specify:

HTML Start = '<html>'
HTML End   = '</html>'

This will create the HTML Start and HTML End symbols in the table. The system will create the HTML terminal. This HTML terminal can be used directly in grammar. So, in the grammar, the developer can specify the following definitions:

<Assign>  ::= Identifier '=' <Value>

<Value>   ::= StringLiteral
           |  HTML

In this case, the HTML terminal would probably not be tokenized - since the terminal syntax of the grammar probably differs greatly from HTML. So, the group can be defined as 'unnested' and 'character' using the attributes. Note: this is not "set in stone", the developer could want to use different attributes.

HTML Block @= { Nesting = None, Advance = Character }

The grammar could accept  the following text:

name = "String Literal"

page =
<html>
   <head>
     <title>Some page</title>
   </head>
   </body>
     This is a tad easier than concatenating a series of strings!
   </body>
</html>

Real-World Examples

ANSI-C (and its children)

ANSI C comments are pretty basic. They cannot be nested, and only advance a character at a time (untokenized).This format is used by all successors of ANSI-C such as C++, Java and C#.  Strangely, line comments are not part of the ANSI-C language definition. But, rarely, has a compiler not recognized them.

Comment Block @= { Nesting = All, Advance = Character }

Comment Start   = '*/'
Comment End     = '/*'
Comment Line    = '//'

Pascal

The Pascal Programming Language has two different block comments. The original version of the language used (* to start a comment and *) to end one. Later, the curly brackets { and } were added. Both are valid in Pascal programs. In addition, the two are synonymous, meaning a comment can start with (* and end with } and vice versa

In this case, the single comment group can be defined. The regular expressions can be defined so Comment Start and Comment End can accept either notation. Normally, Pascal comments cannot be nested, but this varies by compiler. Group definitions specify a terminal name, so extra definitions are necessary.

CommentBlock  @= { Nesting = All, Advance = Character }

StartTerminal  = '{' | '(*'
EndTerminal    = '}' | '*)'

Comment Start  = StartTerminal
Comment End    = EndTerminal

If the developer wants the start and end of the comment to "match", they can define a second group. Only 'Comment' and 'Whitespace' are flagged as being whitespace, but any terminal can be set to whitespace by assigning its attributes. In the follow example, the grammar defines 'Comment2'. The name is really up to the developer. They could have just as easily used 'CommentAlt', 'OtherFormat', etc...

To set this group to noise, the developer uses 'noise' in the attributes. As a result, both Comment and Comment2 will be ignored by the parser. The developer could also manually add the 'noise' attribute to "Comment Attributes", but is not necessary.

Comment Block  @= { Nesting = All, Advance = Character }
Comment2 Block @= { Nesting = All, Advance = Character }

Comment2 @= { Type = Noise }

Comment Start   = '{'
Comment End     = '}'

Comment2 Start  = '(*'
Comment2 End    = '*)'