DFA: RetrieveToken Function

 

function RetrieveToken () returns Token

This function implements the DFA Lexer / Tokenizer.

Parameters:

Stream which contains the "Source" text to be parsed.

Returns

Token

This function implements the DFA algorithm that is used by the scanner (tokenizer). The algorithm starts at the initial DFA state and then attempts to find the longest string of characters that can be accepted as one of the language's terminals.

Each DFA state contains a series of edges that contain a character set. The algorithm essentially looks at all the edges of the current state and then advances along the edge if the current character is in the set.

Because the algorithm needs to "look ahead" at characters, the stream needs to support the ability to load characters without discarding them. This can be implemented with a buffered stream.

if Source has no more characters

Create a new Token.
Set the Type property of Token = End-Of-File.

else

Set State = Initial-DFA-State.
Set Done = False.
Set Length = 1.

loop until Done

Part 1:
If the current state accepts a terminal, keep track of the cursor position and the state index. This will be used later.

if the State accepts a termimal

Set Accept-State = State.
Set Accept-Length = Length.

end if

Part 2:
Look through the edges of the current state and advance if an edge is found that contains the character at Lookahead(Position). When no edge is found, a token is created and the loop is exited.

If no Accept State was set then an error token is created. A single character is read from the Source. This allows error recovery by discarding the character that caused the error.

If an Edge in State exists that contains the character Lookahead(Length) in Source

Set State = the Target-State property of the Edge.
Increment the Length.

else

if the Accept-State was set

Create a new Token.
Set the Parent-Symbol property of Token = symbol in Accept-State.
Set the Data property of Token = read Length characters from Source.

else

Create a new Token.
Set the Parent-Symbol property of Token = Error.
Set the Data property of Token = read one character from Source.

end if

Set Done = True.

end if

next loop

end if

If you plan to maintain a counter of the current line number, you can read through the new token and increment the counter for each carriage-return or line-feed you encounter. Please note that Windows uses a CR+LF, Unix uses a single CR, and Macintosh uses a single LF.

Return the Token.

end function