lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Buitinck <>
Subject [ANNOUNCEMENT] NLP-based Analyzer library for Lucene
Date Tue, 08 Feb 2011 16:50:56 GMT
Dear all,

For anyone wanting to add some NLP abilities to Lucene, I've released
a small library at . This library
performs part-of-speech tagging (determining word categories such as
noun, verb), filtering based on part-of-speech and lemmatizing
(reducing words to their base form).

In other words: this is an NLP-based replacement for a stemmer and a
stop list, implemented as a Lucene analyzer. It requires the Stanford
POS Tagger.

lucene-stanford-lemmatizer can be used to index or query lemmas as
well as the terms as they appear in text, and/or to filter out terms
before indexing/querying based on their part-of-speech. By default, it
filters out pronouns, determiners (the, a) and several other
non-informative word categories.

I've seen this code improve search quality, even on very noisy data.
The software is designed for English, but does a pretty good job at
detecting non-English words and leaving those alone (in contrast to
the Porter/Snowball stemmer).

Lars Buitinck

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message