Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Wed, 6 Jun 2012 08:32:24 +0000 (UTC)
From: "Lance Norskog (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <952975152.42948.1338971544183.JavaMail.jiratomcat@issues-vm>
In-Reply-To: <29163131.294471296402283412.JavaMail.jira@thor>
Subject: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities
 as a module
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290015#comment-13290015 ] 

Lance Norskog commented on LUCENE-2899:
---------------------------------------

Notes for a Wiki page:

OpenNLP Integration

What is the integration? The first integration is a Tokenizer and three Filters. 
* The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the standard Lucene Tokenizers.  This requires statistical model files. One quirk of these is that all punctuation is maintained. 
* The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases), and Named Entity Recognition (tagging people, place names etc.). This filter will add all tags as payload attributes to the tokens.
* The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads, it will either keep only tokens with one of those payloads, or remove only matching tokens and keep the rest. (This filter maintains position increments correctly.)
* The StripPayloadsFilter removes payloads from Tokens. 

How do I get going?
* pull the latest trunk
* apply the patch
* download these models to contrib/opennlp/src/test-* files/opennlp/solr/conf/opennlp/
** [http://opennlp.sourceforge.net/models-1.5/]
** Everything that starts with 'en'
* download the OpenNLP distribution from [http://opennlp.apache.org/cgi-bin/download.cgi]
** Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz
* unpack this and copy the jar files from lib/ to
solr/contrib/opennlp/lib

Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and uses the model files. 
Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp -jar start.jar'
You now should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this, then go vote on [SOLR-3493].


> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org