Index
All Packages
All Categories
By Author

ap (3)
cp (3)
dp (3)
exe (3)
gui (0)
gui/gtk (0)
gui/tk (4)
io (1)
lib (11)
math (0)
net (9)
nlp (18)
op (4)
os (2)
program (3)
sp (2)
tool (9)
wp (2)
xml (2)

Gump Tokenizer

type:package
id:mogul:/lager/gump-tokenizer
section:mogul:/lager
blurb:A Gump-based natural language tokenizer
author:Torbjörn Lager
category:nlp
documentation:index.html
download:lager-gump-tokenizer__1.2.5__source__0.pkg
lager-gump-tokenizer__1.3.0__source__0.pkg
provides:[nlp] x-ozlib://lager/gump-tokenizer/EnglishTokenizer.ozf
[nlp] x-ozlib://lager/gump-tokenizer/tokenize.exe

This is a natural language tokenizer based on Gump (which in itself is based on Flex). It improves upon the simple tokenizer, and represents a much better approach to tokenization of natural language. Among other things, it does not need the company of a sentence splitter, since it handles sentence splitting all by itself. It is however somewhat slower - and more heavyweight - than the simple tokenizer.

Although the tokenizer in this package is set up for English, it should be fairly straightforward to port to other (similar) languages. However, note that a tokenizer for natural language often needs to be tuned, not only to a particular language, but also to the kind of texts on which it is going to be used.