This is an implementation in pure Oz of the text categorization method described in
Cavnar, W. B. and J. M. Trenkle, N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Gertjan van Noord's implementation in Perl, available from http://odur.let.rug.nl/~vannoord/TextCat/ provided lots of inspiration too. Like van Noord's distribution, this Oz-implementation concentrates on the task of recognizing languages. The method as such is however more general than that. Indeed, Cavnar and Trenkle uses it to categorize documents based on their contents as well, and there is no reason why this wouldn't work with this implementation too.
There are two modules of particular interest to a potential user. The TextCategorizer module exports a class with public methods useful for text categorization as such, when a set of categories and their corresponding models are assumed to already exist (in the form of a pickled record). The TextCategorizerManager inherits from TextCategorizer and makes publicly available a number of methods useful for creating new models of known texts.
Download the package, and invoke ozmake in a shell as follows:
ozmake --install --package=lager-text-categorizer.pkg
By default, all files of the package are installed in the user's ~/.oz
directory tree. In particular, all modules are installed in the user's private
cache.
init(File)
File
.
rank(String ?Ranking)
Ranking
gets bound to a list of pairs of the form ModelName#Distance
,
where Distance
is an integer representing the distance between
the model ModelName
and the model of String
. The
list is sorted in order of increasing distance.
categorize(String ?ModelName)
ModelName
gets bound to the name of the model closest to the
model of String
.
models(?ModelNames)
ModelName
s to the list of names of stored models.
init(File<=new)
File
,
or (by default) starts from a new model.
addModel(ModelName String)
String
to the model store under the name ModelName
.
addModelFromFile(ModelName File)
File
to the model store under
the name ModelName
.
addModelsFromDir(Dir)
F
in directory Dir
, a model
of the contents of F
to the current model store. To be considered,
the name of F
must have the form <name>.txt
(<name>
must not contain any period) and the resulting
model is stored under the name <name>
.
saveModels(File)
File
.
+ methods inherited from TextCategorizer
The distribution includes two example applications: categorize
and train
. These applications use the TextCategorizer module and
TextCategorizerManager module, respectively. For example, categorize
may be invoked as follows
categorize -l "This is an example of English"
and will then load the default model store and simply print, on standard out
Closest match: english
To figure out what models (of, in this case, languages) are supported by the current model store, you say:
categorize -c
which will print, on standard out, the list
Available models: danish dutch english estonian finnish french german hungarian icelandic italian norwegian polish portuguese spanish swedish turkish
You may use train
to create new models. For example, the invocation
train --directory=shortTexts --out=mymodels.ozp
will create new models for the text files in the directory shortTexts
,
and add them to mymodels.ozp
. (By the way, I have borrowed these
language samples from van Noord's distribution.) The program will consider each
file of the form <name>.txt
(no periods are allowed in <name>
)
and the corresponding model will be named <name>
. If there
is already a model in mymodels.ozp
with that name, it will be replaced.
--in
and --out
may point to the same file. If --in
is not specified, a new model will be created. If --out
is not
specified, the store will be saved in default.ozp
.