Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89F5CC34D for ; Wed, 6 Jun 2012 08:32:26 +0000 (UTC) Received: (qmail 88387 invoked by uid 500); 6 Jun 2012 08:32:25 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 88256 invoked by uid 500); 6 Jun 2012 08:32:24 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 88226 invoked by uid 99); 6 Jun 2012 08:32:24 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jun 2012 08:32:24 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 2C28A140B94 for ; Wed, 6 Jun 2012 08:32:24 +0000 (UTC) Date: Wed, 6 Jun 2012 08:32:24 +0000 (UTC) From: "Lance Norskog (JIRA)" To: dev@lucene.apache.org Message-ID: <952975152.42948.1338971544183.JavaMail.jiratomcat@issues-vm> In-Reply-To: <29163131.294471296402283412.JavaMail.jira@thor> Subject: [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290015#comment-13290015 ] Lance Norskog commented on LUCENE-2899: --------------------------------------- Notes for a Wiki page: OpenNLP Integration What is the integration? The first integration is a Tokenizer and three Filters. * The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the standard Lucene Tokenizers. This requires statistical model files. One quirk of these is that all punctuation is maintained. * The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases), and Named Entity Recognition (tagging people, place names etc.). This filter will add all tags as payload attributes to the tokens. * The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads, it will either keep only tokens with one of those payloads, or remove only matching tokens and keep the rest. (This filter maintains position increments correctly.) * The StripPayloadsFilter removes payloads from Tokens. How do I get going? * pull the latest trunk * apply the patch * download these models to contrib/opennlp/src/test-* files/opennlp/solr/conf/opennlp/ ** [http://opennlp.sourceforge.net/models-1.5/] ** Everything that starts with 'en' * download the OpenNLP distribution from [http://opennlp.apache.org/cgi-bin/download.cgi] ** Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz * unpack this and copy the jar files from lib/ to solr/contrib/opennlp/lib Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and uses the model files. Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp -jar start.jar' You now should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this, then go vote on [SOLR-3493]. > Add OpenNLP Analysis capabilities as a module > --------------------------------------------- > > Key: LUCENE-2899 > URL: https://issues.apache.org/jira/browse/LUCENE-2899 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Reporter: Grant Ingersoll > Priority: Minor > Attachments: opennlp_trunk.patch > > > Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does: > * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens) > * NamedEntity recognition as a TokenFilter > We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. > I'd propose it go under: > modules/analysis/opennlp -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org