- Up - | Next >> |
As a running example we will specify, throughout the manual, a front-end for a compiler or an interpreter for a small functional language Lambda. In this section we will define the scanner for this language, in Section 3.1 we build a parser on top of this scanner.
Program 2.1 shows the specification of the sample scanner we will consider in this section. In the following we will examine this example line by line.
Class Descriptors
At the first glance the scanner specification closely resembles a class definition with some extra elements, introduced by the keyword scanner
instead of class
. This is intentional, since it will ultimately be replaced by a class. This is why all descriptors allowed in a class definition are also allowed at the beginning of a scanner specification. Consider the from
, attr
and meth
constructs used in lines 2 to 10.
Lexical Abbreviations
The scanner-specific declarations begin at line 12. Two kinds of definition can be introduced by the keyword lex
: either a lexical abbreviation, as seen in lines 12 to 15, or a lexical rule as found from line 17 to the end of the specification. A lexical abbreviation lex
I = <
R> end
associates an identifier I with a given regular expression R. Occurrences of {
I}
in other regular expressions are then replaced to (
R)
.
Note that regular expressions use the same syntax as regular expressions in flex [Pax95], with a few exceptions (detailed in Section 2.2.1). Furthermore, we must either enclose them in angle brackets or give them as Oz strings. (The latter proves useful when the angle-bracketed version confuses Emacs' fontification mode, but is a bit harder to read, since more characters must be escaped.)
The example defines four lexical abbreviations: digit
stands for a decimal digit, letter
for an uppercase or lowercase letter; id
defines the syntax of identifiers to consist of a letter, followed by an arbitrary sequence of letters and digits; and finally, int
defines the syntax of positive decimal integers as a nonempty sequence of digits.
Lexical Rules
Lexical rules of the form lex <
R>
S end
are more interesting, since the set of these is the actual scanner specification. Upon a match of a prefix of the input character stream with the regular expression R
, the statement S
is executed as a method body (i. e., self
may be accessed and modified). Two methods are provided by the mixin class GumpScanner.'class'
(inherited from in line 2) to append tokens to the token stream: putToken1
, which appends a token of a given class without a value (unit
being used instead), and putToken
, which allows a specific token value to be provided. Token classes may be represented by arbitrary Oz values, but the parser generator in Chapter 3 expects them to be atoms. In lines 18 and 21 you can see how constants are used as token classes. In line 33 the token class is computed from the lexeme.
Accessing the Lexeme
The lexeme itself may be accessed in several ways. The method getAtom
returns the lexeme as an atom, which is the representation for identifier token values chosen in line 25. The method getString
returns the lexeme as a string, such as in line 28, where it is subsequently converted to an integer.
The remaining lexical rules are easily explained. Lines 36 and 37 respectively describe how whitespace and comments are to be ignored. This is done by neither calling putToken1
nor putToken
. (Note that an action can also invoke them several times to append multiple tokens to the token stream, just as it may chose not to invoke them at all to simply ignore the lexeme or only produce side effects.) The rule in line 38 ignores any matched newlines, but updates the line counter attribute LineNumber
as it does so. The rule in line 41 reports any remaining unmatched characters in the input as lexical errors and returns the token 'error'
which the parser can recognize as an erroneous token.
End-of-File Rules
The final rule, in line 46, has the special syntax <<EOF>>
(it might also have been written as "<<EOF>>"
) and only matches the end of the character stream. It returns the token 'EOF'
which can be recognized by the parser as the end of input. Note that the action might just as well open another file to read from.
More information about acceptable sets of regular expressions in scanner specifications, conflict resolution and grouping into lexical modes is given in Section 2.2.1.
declare
scanner LambdaScanner from GumpScanner.'class'
attr LineNumber
meth init()
GumpScanner.'class', init()
LineNumber <- 1
end
meth getLineNumber($)
@LineNumber
end
lex digit = <[0-9]> end
lex letter = <[A-Za-z]> end
lex id = <{letter}({letter}|{digit})*> end
lex int = <{digit}+> end
lex <define>
GumpScanner.'class', putToken1('define')
end
lex <lambda>
GumpScanner.'class', putToken1('lambda')
end
lex <{id}> A in
GumpScanner.'class', getAtom(?A)
GumpScanner.'class', putToken('id' A)
end
lex <{int}> S in
GumpScanner.'class', getString(?S)
GumpScanner.'class', putToken('int' {String.toInt S})
end
lex <"."|"("|")"|"="|";"> A in
GumpScanner.'class', getAtom(?A)
GumpScanner.'class', putToken1(A)
end
lex <[ \t]> skip end
lex <"%".*> skip end
lex <\n>
LineNumber <- @LineNumber + 1
end
lex <.>
{System.showInfo 'line '#@LineNumber#': unrecognized character'}
GumpScanner.'class', putToken1('error')
end
lex <<EOF>>
GumpScanner.'class', putToken1('EOF')
end
end
Now that we have finished writing our specification, we want to translate it into an Oz class definition that implements our scanner. For this, we issue the compiler directive
\switch +gump
whereupon the compiler will accept Gump specifications.
Running Gump
Save the above specification in a file LambdaScanner.ozg
. The extension .ozg
indicates that this file contains Oz code with additional Gump definitions, so that Emacs will fontify Gump definitions correctly. Feeding
\insert LambdaScanner.ozg
will process this file. Switch to the Compiler buffer (via C-c C-c) to watch Gump's status messages and any errors occurring during the translation.
Output Files
When the translation is finished, you will notice several new files in the current working directory. These will be named after your scanner
specification. Suppose your scanner was called S
, then you will find files S.l
, S.C
, S.o
and S.so
. The first three are intermediate results (respectively the input file for flex, the flex-generated C++ file and the object code produced by the C++ compiler) and the last one is the resulting dynamic library used by the generated scanner.
Implementation Limitation
Note that due to limitations of dynamic linking, a scanner may only be loaded once into the system. When interactively developing a scanner, this means that you will not see changes you make to the set and order of the regular expressions consistently. You should thus halt and restart Mozart each time you make changes to the regular expressions.
See also Section 2.2.2 for a workaround around this limitation.
Program 2.2 shows a sample program running our generated scanner.
The generated LambdaScanner
class is instantiated as MyScanner
. We have to call the method init()
first to initialize the internal structures of the GumpScanner.'class'
.
Requesting Tokens
The procedure GetTokens
repeatedly invokes the GumpScanner.'class'
method
getToken(
?X
?Y
)
which returns the next token's token class in X
and token value in Y
and removes it from the token stream. GetTokens
exits when the end of the token stream is reached, which is recognized by the token class 'EOF'
.
Providing Inputs
To actually start scanning we have to provide an input character stream. This is done via one of the methods
scanFile(
+FileName
)
scanVirtualString(
+V
)
Each of these pushes the currently used buffer (if any) upon an internal stack of buffers and builds a new buffer from the given source. Each time the end of a buffer is reached, the <<EOF>>
rule is matched. This may pop a buffer and continue scanning the next-outer buffer where it left off, using the closeBuffer
method described in Section 2.2.3.
Closing Scanners
When a scanner is not used anymore, it should be sent the message
close()
so that it can close any open files and release any allocated buffers. (This is even necessary when scanning virtual strings due to the underlying implementation in C++.)
The following is a sample input for the scanner. The above example expects this to be placed in the file Lambda.in
in the current directory:
% some input to test the class LambdaScanner
define f = lambda y.lambda z.(add y z);
define c = 17;
f c 7;
((f) c) 7
\switch +gump
\insert gump/examples/LambdaScanner.ozg
local
MyScanner = {New LambdaScanner init()}
proc {GetTokens} T V in
{MyScanner getToken(?T ?V)}
case T of 'EOF' then
{System.showInfo 'End of file reached.'}
else
{System.show T#V}
{GetTokens}
end
end
in
{MyScanner scanFile('Lambda.in')}
{GetTokens}
{MyScanner close()}
end
- Up - | Next >> |