Index |
All Packages |
All Categories |
By Author |
ap (3) |
cp (3) |
dp (3) |
exe (3) |
gui (0) |
gui/gtk (0) |
gui/tk (4) |
io (1) |
lib (11) |
math (0) |
net (9) |
nlp (18) |
op (4) |
os (2) |
program (3) |
sp (2) |
tool (9) |
wp (2) |
xml (2) |
type | : | package |
id | : | mogul:/lager/text-categorizer |
section | : | mogul:/lager |
blurb | : | An N-gram-based text categorizer/language recognizer |
author | : | Torbjörn Lager |
category | : | nlp |
documentation | : | index.html |
download | : | lager-text-categorizer__1.2.5__source__0.pkg lager-text-categorizer__1.3.0__source__0.pkg |
provides | : | [nlp] x-ozlib://lager/text-categorizer/TextCategorizer.ozf [nlp] x-ozlib://lager/text-categorizer/TextCategorizerManager.ozf [nlp] x-ozlib://lager/text-categorizer/categorize.exe [nlp] x-ozlib://lager/text-categorizer/train.exe |
This is an implementation in pure Oz of the text categorization method described in
Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Gertjan van Noord's implementation in Perl, available from http://odur.let.rug.nl/~vannoord/TextCat/ provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.
There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.