This is a natural language tokenizer based on Gump (which in itself is based on Flex). It improves upon the simple tokenizer, and represents a much better approach to tokenization of natural language. Among other things, it does not need the company of a sentence splitter, since it handles sentence splitting all by itself. It is however somewhat slower - and more heavyweight - than the simple tokenizer.
Although the tokenizer in this package is set up for English, it should be fairly straightforward to port to other (similar) languages. However, note that a tokenizer for natural language often needs to be tuned, not only to a particular language, but also to the kind of texts on which it is going to be used.
In the present version of the tokenizer, only four token classes are distinguished:
p |
Paragraph delimiter (ends a paragraph) |
s |
Sentence delimiter (ends a sentence) |
w |
'Word' (includes ordinary words, but also abbreviations, numbers, etc.) |
c |
Other (separators, etc.) |
As can be seen from the actual Gump definitions (in the source file 'EnglishTokenizer.oz'), it would be possible to have a more fine-grained set of classes (recognizing e.g. abbreviations, dates, etc.), but the risk for misclassification would greatly increase.
The tokenizer separate contractions into multiple tokens, e.g. splits the word "don't" into two tokens "do" and "n't", where "n't" is treated as a special form of "not". The word "John's" is treated as two tokens "John" and "'s". It is done in this way because this is how the Brill tagger wants it.
Download the package, and invoke ozmake in a shell as follows:
ozmake --install --package=lager-gump-tokenizer.pkg
By default, all files of the package are installed in the user's ~/.oz
directory tree. In particular, all modules are installed in the user's private
cache.
Tokenizer.'class' defines functionality, inherited from Gump, that can be used by users of the generated tokenizer. Listed below is only a part of what is available. Refer to the Gump manual for more information.
meth init()
.
This must be called before any other method of this class. meth getToken(?X Y)
X
and its value in Y
.
Both X
and Y
are atoms.meth scanFile(
+F
)
F
and
tokenized. If the file does not exist, the error exception gump(fileNotFound
F
)
with the filename in F
is raised. meth scanVirtualString(
+V
)
scanFile
, but scans a virtual string V
.
meth close()
init()
again. This is how a we (in the OPI) write a function GetSentence
that
will retrieve one sentence (list of words) from the tokenizer each time it
is called:
declare
%% Link functor, get module
[Tokenizer] = {Module.link ['x-ozlib://lager/gump-tokenizer/EnglishTokenizer.ozf']}
%% Create and initialize Tokenizer object
MyTokenizer = {New Tokenizer.'class' init()}
%% Tokenize file
{MyTokenizer scanFile('test.txt')}
fun {GetSentence} T V in
{MyTokenizer getToken(?T ?V)}
case T
of 'EOF' then nil
[] p then nil
[] s then [V]
[] w then V|{GetSentence}
[] c then V|{GetSentence}
end
end
%% Each time you feed this the Inspector will
%% show a different sentence from 'test.txt'
%% and 'nil' when there are no sentences left
{Inspect {GetSentence}}
/* Feed this to close tokenizer when you're done.
{MyTokenizer close()}
*/
The distribution also include a stand-alone application which prints each token and its class on a separate line. It can be invoked in the following way on a text file:
tokenize --in=test.txt