lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Smith <ssm...@mainstreamdata.com>
Subject RE: [ANNOUNCEMENT] NLP-based Analyzer library for Lucene
Date Mon, 14 Feb 2011 18:36:05 GMT
One thing to note is that the Stanford POS Tagger is licensed using GPL v2.  A commercial license
is available, but it doesn't appear to be free ($3k min if I read correctly).

I wonder what it would take to make this available using OpenNLP which has a friendlier license.

-----Original Message-----
From: Lars Buitinck [mailto:larsmans@gmail.com] 
Sent: Tuesday, February 08, 2011 9:51 AM
To: Apache Lucene users
Subject: [ANNOUNCEMENT] NLP-based Analyzer library for Lucene

Dear all,

For anyone wanting to add some NLP abilities to Lucene, I've released
a small library at
https://github.com/larsmans/lucene-stanford-lemmatizer . This library
performs part-of-speech tagging (determining word categories such as
noun, verb), filtering based on part-of-speech and lemmatizing
(reducing words to their base form).

In other words: this is an NLP-based replacement for a stemmer and a
stop list, implemented as a Lucene analyzer. It requires the Stanford
POS Tagger.

lucene-stanford-lemmatizer can be used to index or query lemmas as
well as the terms as they appear in text, and/or to filter out terms
before indexing/querying based on their part-of-speech. By default, it
filters out pronouns, determiners (the, a) and several other
non-informative word categories.

I've seen this code improve search quality, even on very noisy data.
The software is designed for English, but does a pretty good job at
detecting non-English words and leaving those alone (in contrast to
the Porter/Snowball stemmer).

Regards,
Lars Buitinck

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message