mogul:/lager/text-categorizer

type	:	package
id	:	`mogul:/lager/text-categorizer`
section	:	mogul:/lager
blurb	:	An N-gram-based text categorizer/language recognizer
author	:	Torbjörn Lager
category	:	`nlp`
documentation	:	index.html
download	:	lager-text-categorizer__1.2.5__source__0.pkg lager-text-categorizer__1.3.0__source__0.pkg
provides	:	`[nlp] x-ozlib://lager/text-categorizer/TextCategorizer.ozf` `[nlp] x-ozlib://lager/text-categorizer/TextCategorizerManager.ozf` `[nlp] x-ozlib://lager/text-categorizer/categorize.exe` `[nlp] x-ozlib://lager/text-categorizer/train.exe`

This is an implementation in pure Oz of the text categorization method described in

Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

Gertjan van Noord's implementation in Perl, available from http://odur.let.rug.nl/~vannoord/TextCat/ provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.

There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.

Text Categorizer