Simple Sentence Splitter

provides: x-ozlib://lager/sentence-splitter/SentenceSplitter.ozf

Purpose

The SentenceSplitter module exports a simple sentence splitter for English. Given a string, assumed to be English text, it returns a list of strings, where each element is an English sentence. By default, it treats occurrences of '.', '?' and '!' as sentence delimiters, but does its best to determine when an occurrence of '.' does not have this role (e.g. in abbreviations, URLs, numbers, etc.). Although the splitter is designed to work for English, it should be straightforward to adapt to other (similar) languages.

It is important to emphasize that this is a simple program which does not always do the right thing. It was written to be used by the Brill tagger, but since it is also independently useful, I decided to make it available separately. Hopefully, it will eventually be replaced with something better.

Installation

Download the package, and invoke ozmake in a shell as follows:

ozmake --install --package=lager-sentence-splitter.pkg

By default, all files of the package are installed in the user's ~/.oz directory tree. In particular, all modules are installed in the user's private cache.

Usage

import SentenceSplitter at 'x-ozlib://lager/sentence-splitter/SentenceSplitter.ozf'      
... 
{SentenceSplitter.split +S ?Ss}

Example

For example,

{SentenceSplitter.split "THE BIG RIPOFF

Mr. John B. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid far too much for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. "}

yields

["THE BIG RIPOFF" "Mr. John B. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid far too much for it.""Did he mind?" "Adam Jones Jr. thinks he didn't.""In any case, this isn't true...""Well, with a probability of .9 it isn't."]

as output.

Example Application

The distribution also include a stand-alone application which prints each sentence on a separate line. It can be invoked in the following way on a text file:

split --in=test.txt

Torbjörn Lager