lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "OpenNLP" by LanceXNorskog
Date Fri, 15 Jun 2012 10:13:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "OpenNLP" page has been changed by LanceXNorskog:
http://wiki.apache.org/solr/OpenNLP

New page:
<!> [[Solr4.0]]
<<TableOfContents(3)>>

<!> This page discusses uncommitted code and design.  See https://issues.apache.org/jira/browse/LUCENE-2899
for the main JIRA issue tracking this development.

== Introduction ==

OpenNLP is a toolkit for Natural Language Processing (NLP). It is an Apache top-level project
located [[here|http://opennlp.apache.org/]]. It includes implementations of many popular NLP
algorithms. This project integrates some of its features into Lucene and Solr. This first
effort incorporates Analyzer chain tools for sentence detection, tokenization, Parts-of-Speech
tagging (nouns, verbs, ejaculations, etc.), Chunking (noun phrases, verb phrases) and Named
Entity Recognition.  See the OpenNLP project page for information on the implementations.
 Here are some use cases:

=== Indexing interesting words ===
NLP lets you create a field with only the nouns in a document. This would be useful for many
free text applications. The FilterPayloadsFilter and StripPayloadsFilter below are required
for this. See "Full Example" below.

=== Interesting N-Grams ===
Chunking lets you create N-Grams only within noun and verb phrases.

=== Named Entity Recognition ===
Named Entity Recognition identifies names, dates, places, currency and other types of data
within free text. This is profoundly useful in searching. Or, you can create autosuggest entries
with icons for 'Name', 'Place', etc.

== Analyzer tools ==

The OpenNLP Tokenizer behavior is similar to the WhiteSpaceTokenizer but is smart about inter-word
punctuation. The term stream looks very much like the way you parse words and punctuation
while reading. The OpenNLP taggers assign payloads to terms. There are tools to filter the
term stream according to the payload values, and to remove the payloads.

=== solr.OpenNLPTokenizerFactory ===

Tokenizes text into sentences or words.

This Tokenizer uses the OpenNLP Sentence Detector and/or Tokenizer classes. When used together,
the Tokenizer receives sentences and can do a better job. The arguments give the file names
of the statistical models:

{{{
    <fieldType name="text_opennlp" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.OpenNLPTokenizerFactory"
          sentenceModel="opennlp/en-sent.bin"
          tokenizerModel="opennlp/en-token.bin"
        />
      </analyzer>
    </fieldType>
}}}

=== solr.OpenNLPFilterFactory ===

Tags words using one or more technologies: Parts-of-Speech, Chunking, and Named Entity Recognition.


{{{
    <fieldType name="text_opennlp_pos" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.OpenNLPTokenizerFactory"
          tokenizerModel="opennlp/en-token.bin"
        />
        <filter class="solr.OpenNLPFilterFactory" 
          posTaggerModel="opennlp/en-pos-maxent.bin"
        />       
      </analyzer>
    </fieldType>
}}}

This example assigns parts of speech tags based on a model derived with the [[http://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-maxent/index.html|OpenNLP
Maximum Entropy]] implementation. See [[http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.postagger.tagging|OpenNLP
Tagging]] for more information. The tags are from the [[http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|Penn
Treebank]] tagset

=== solr.FilterPayloadsFilterFactory ===

Filter terms for certain payload values. In this example, retain only terms which have been
marked 'nouns' and 'verbs' with the [[http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|Penn
Treebank]] tagset.

{{{
        <filter class="solr.FilterPayloadsFilterFactory" keepPayloads="true"
          payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>
}}}

=== solr.StripPayloadsFilterFactory ===

Remove payloads from terms.

{{{
        <filter class="solr.StripPayloadsFilterFactory"/>
}}}

== Full Example ==

This "Noun-Verb Filter" field type assigns parts of speech, retains only nouns and verbs,
and removes the payloads. Free-text search sites (for example, newspaper and magazine articles)
may benefit from this.
{{{
    <fieldType name="text_opennlp_nvf" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.OpenNLPTokenizerFactory"
          tokenizerModel="opennlp/en-token.bin"
        />
        <filter class="solr.OpenNLPFilterFactory"
          posTaggerModel="opennlp/en-pos-maxent.bin"
        />
        <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>
        <filter class="solr.StripPayloadsFilterFactory"/>
      </analyzer>
    </fieldType>
}}}

This example should work well with most English-language free text. 

== Installation ==

See the patch for more information. The short story is you have to download statistical models
from sourceforge to make OpenNLP work- the models do not have an Apache-compatible license.

Mime
View raw message