mogul:/lager/gump-tokenizer

type	:	package
id	:	`mogul:/lager/gump-tokenizer`
section	:	mogul:/lager
blurb	:	A Gump-based natural language tokenizer
author	:	Torbjörn Lager
category	:	`nlp`
documentation	:	index.html
download	:	lager-gump-tokenizer__1.2.5__source__0.pkg lager-gump-tokenizer__1.3.0__source__0.pkg
provides	:	`[nlp] x-ozlib://lager/gump-tokenizer/EnglishTokenizer.ozf` `[nlp] x-ozlib://lager/gump-tokenizer/tokenize.exe`

This is a natural language tokenizer based on Gump (which in itself is based on Flex). It improves upon the simple tokenizer, and represents a much better approach to tokenization of natural language. Among other things, it does not need the company of a sentence splitter, since it handles sentence splitting all by itself. It is however somewhat slower - and more heavyweight - than the simple tokenizer.

Although the tokenizer in this package is set up for English, it should be fairly straightforward to port to other (similar) languages. However, note that a tokenizer for natural language often needs to be tuned, not only to a particular language, but also to the kind of texts on which it is going to be used.

Gump Tokenizer