This is a natural language tokenizer based on Gump
(which in itself is based on Flex). It improves upon the simple
tokenizer, and represents a much better approach to tokenization of natural
language. Among other things, it does not need the company of a sentence splitter,
since it handles sentence splitting all by itself. It is however somewhat slower
- and more heavyweight - than the simple tokenizer.
Although the tokenizer in this package is set up for English, it should be
fairly straightforward to port to other (similar) languages. However, note that
a tokenizer for natural language often needs to be tuned, not only to a particular
language, but also to the kind of texts on which it is going to be used.