Gump Tokenizer

A Gump-based natural language tokenizer
Torbjörn Lager
provides:[nlp] x-ozlib://lager/gump-tokenizer/EnglishTokenizer.ozf
[nlp] x-ozlib://lager/gump-tokenizer/tokenize.exe

This is a natural language tokenizer based on Gump (which in itself is based on Flex). It improves upon the simple tokenizer, and represents a much better approach to tokenization of natural language. Among other things, it does not need the company of a sentence splitter, since it handles sentence splitting all by itself. It is however somewhat slower - and more heavyweight - than the simple tokenizer.

Although the tokenizer in this package is set up for English, it should be fairly straightforward to port to other (similar) languages. However, note that a tokenizer for natural language often needs to be tuned, not only to a particular language, but also to the kind of texts on which it is going to be used.