From Drew Farris <>
Subject Re: Collocation clarification
Date Fri, 15 Jan 2010 22:49:57 GMT
On Fri, Jan 15, 2010 at 10:00 AM, Grant Ingersoll <> wrote:
> Yeah, I think it makes sense to have a SentenceTokenFilter (as well as ParagraphTokenFilter).
 In fact, this would be a welcome contribution to Lucene as a new package under the Analyzers
in a "o.a.l.analysis.boundary" package (to include other boundary detection techniques, such
as paragraph, etc.)  Define a common set of constants that indicate the boundary and then
we can have different implementations.  If you really wanted to go nuts, you could create
a SpanBoundaryQuery classes that took in other clauses along w/ the boundary token and did
a SpanNearQuery within boundaries.  Of course, I don't want to distract you from contributing
to Mahout, so...

Ok, thanks for the pointer and roughing out an approach. I'll look
into a SentenceTokenFilter and see where that takes me.

>> Any idea what sort of edge cases I need to look for when using BreakIterator?
> Buy the book  :-)...  Just kidding, it doesn't handle abbreviations very well, is the
first thing that jumps to mind.  I seem to recall needing less than 10 or so rules to do
a pretty decent job.   Never did formal testing on it, though.

Ok, OK :-)

I've found abbrevs, various identifiers etc are sort of a typical case
where these things fall flat. I'll see how it performs viz writing
something from scratch and see what I can come up with.

> Right, although just slightly ironic that we are using a rule-based system for a machine
learning project.

Heh, indeed, but it seems entirely appropriate in this case. Of
course, now I need to go read about statistical approaches to sentence
boundary detection.


