lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: does anyone know of a 'smart' categorizing text pattern finder?
Date Tue, 21 Nov 2006 22:46:01 GMT
Vladimir Olenin wrote:
> Hi,
> I wonder if anyone here knows if there is a 'smart' text pattern finder, ideally written
in Java. The library I'm looking for should be able to 'guess' the category of the particular
text on the page, most probably by finding similarities between the bulk of the pages and
a set of templates.

This is another problem you can actually do pretty
well in Lucene itself.  Either index with your usual
analyzer or use the n-gram analyzers we wrote about
in LingPipe in Action.

Then create an index with a single pseudo-document
per topic, containing all the text you want to use
to train the topic.

Then run the document to classify as a query against
the index, and the highest scoring pseudo-document
is the most likely category according to token

You could also check out our more probabilistic
classifiers.  For instance, we have a classification
by topic demo:

And just about every other natural language platform
and most machine learning platforms do classificaton
(e.g. Mallet and MinorThird, both in Java).  For
general structured classification problems, you
might want to check out Weka.

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message