Module Tokenizer.ozf exports a class that implements a simple tokenizer for natural language. Given a string, it returns a list of strings (but this can be changed to e.g. a list of atoms by subclassing), where each string is a considered a token. The tokenizer has a reasonable default behaviour for most European languages (well, for English and Swedish at least...), and it can be tailored to specific languages and applications by subclassing. For example, the tokenizer for English separate contractions into multiple tokens, e.g. splits the word "don't" into two tokens "do" and "n't", where "n't" is treated as a special form of "not". The word "John's" is treated as two tokens "John" and "'s". It is done in this way because this is how the Brill tagger wants it.
It is important to emphasize that this is a simple program. It was written to be used by the Brill tagger, but since it is also independently useful, I decided to make it available separately.
Download the package, and invoke ozmake in a shell as follows:
ozmake --install --package=lager-simple-tokenizer.pkg
By default, all files of the package are installed in the user's ~/.oz
directory tree. In particular, all modules are installed in the user's private
cache.
Module Tokenizer.ozf exports, on feature class
, a class
definition for a tokenizer for natural language. It is up to each application
to specialize the methods for individual natural languages.
init()
tokenize(String ?Tokens)
Tokens
gets bound to the tokens in String.
These are the overridable methods that control how the tokenizer works:
isWordChar(C ?B)
B
is bound to true
if C
is to be
handled as part of a word.
isPunctuationChar(C ?B)
B
is bound to true
if C
is to be
handled as a punctuation char.
toToken(Cs ?Token)
postProcess(TokensIn ?TokensOut)
The package also contains a tokenizer for English that is implemented by subclassing Tokenizer.
The distribution also include a stand-alone application which prints each token on a separate line. It can be invoked in the following way on a text file:
tokenize --in=test.txt