lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <>
Subject [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
Date Wed, 06 Jun 2012 08:32:24 GMT


Lance Norskog commented on LUCENE-2899:

Notes for a Wiki page:

OpenNLP Integration

What is the integration? The first integration is a Tokenizer and three Filters. 
* The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the
standard Lucene Tokenizers.  This requires statistical model files. One quirk of these is
that all punctuation is maintained. 
* The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases),
and Named Entity Recognition (tagging people, place names etc.). This filter will add all
tags as payload attributes to the tokens.
* The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads,
it will either keep only tokens with one of those payloads, or remove only matching tokens
and keep the rest. (This filter maintains position increments correctly.)
* The StripPayloadsFilter removes payloads from Tokens. 

How do I get going?
* pull the latest trunk
* apply the patch
* download these models to contrib/opennlp/src/test-* files/opennlp/solr/conf/opennlp/
** []
** Everything that starts with 'en'
* download the OpenNLP distribution from []
** Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz
* unpack this and copy the jar files from lib/ to

Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and
uses the model files. 
Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp
-jar start.jar'
You now should start without any Exceptions. At this point, go to the Schema analyzer, pick
the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should
get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead
of text. If you would like this, then go vote on [SOLR-3493].

> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>                 Key: LUCENE-2899
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: opennlp_trunk.patch
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a
submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I
have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have
to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads
(PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message