lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r547226 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/analysis/package.html
Date Thu, 14 Jun 2007 12:09:03 GMT
Author: gsingers
Date: Thu Jun 14 05:09:02 2007
New Revision: 547226

Lucene 925: analysis javadocs


Modified: lucene/java/trunk/CHANGES.txt
--- lucene/java/trunk/CHANGES.txt (original)
+++ lucene/java/trunk/CHANGES.txt Thu Jun 14 05:09:02 2007
@@ -276,6 +276,8 @@
  4. LUCENE-740: Added SNOWBALL-LICENSE.txt to the snowball package and a
     remark about the license to NOTICE.TXT. (Steven Parkes via Michael Busch)
+ 5. LUCENE-925: Added analysis package javadocs. (Grant Ingersoll and Doron Cohen)
  1. LUCENE-802: Added LICENSE.TXT and NOTICE.TXT to Lucene jars.

Modified: lucene/java/trunk/src/java/org/apache/lucene/analysis/package.html
--- lucene/java/trunk/src/java/org/apache/lucene/analysis/package.html (original)
+++ lucene/java/trunk/src/java/org/apache/lucene/analysis/package.html Thu Jun 14 05:09:02
@@ -5,6 +5,90 @@
    <meta name="Author" content="Doug Cutting">
-API and code to convert text into indexable tokens.
+<p>API and code to convert text into indexable/searchable tokens.  Covers {@link org.apache.lucene.analysis.Analyzer}
and related classes.</p>
+<h2>Parsing? Tokenization? Analysis!</h2>
+Lucene, indexing and search library, accepts only plain text input.
+Applications that build their search capabilities upon Lucene may support documents in various
formats - HTML, XML, PDF, Word - just to name a few.
+Lucene does not care about the <i>Parsing</i> of these and other document formats,
and it is the responsibility of the 
+application using Lucene to use an appropriate <i>Parser</i> to convert the original
format into plain text, before passing that plain text to Lucene.
+Plain text passed to Lucene for indexing goes through a process generally called tokenization
- namely breaking of the 
+input text into small indexing elements - <i>Tokens</i>. The way that the input
text is broken into tokens very 
+much dictates the further search capabilities of the index into which that text was added.
+beginnings and endings can be identified to provide for more accurate phrase and proximity
+(though sentence identification is not provided by Lucene).
+In some cases simply breaking the input text into tokens is not enough - a deeper <i>Analysis</i>
is needed,
+providing for several functions, including (but not limited to):
+  <li>Stemming -- Replacing of words by their stems. For instance with English stemming
"bikes" is replaced by "bike"; now query "bike" can find both documents containing "bike"

+      and those containing "bikes". See <a href="">Wikipedia</a>
for more information.</li>
+  <li>Stop words removal -- Common words like "the", "and" and "a" rarely add any value
to a search.  Removing them shrinks the index size and increases performance.</li>
+  <li>Character normalization -- Stripping accents and other character markings can
make for better searching.</li>
+  <li>Synonyms expansion -- Adding in synonyms at the same token position as the current
word can mean better matching when a users search with words in the synonym set.</li>
+<h2>Core Analysis</h2>
+  The analysis package provides the mechanism to convert Strings and Readers into tokens
that can be indexed by Lucene.  There
+  are three main classes in the package from which all analysis processes are derived.  These
+  <ul>
+    <li>{@link org.apache.lucene.analysis.Analyzer} -- An Analyzer is responsible for
building a TokenStream which can be consumed
+    by the indexing and searching processes.  See below for more information on implementing
your own Analyzer.</li>
+    <li>{@link org.apache.lucene.analysis.Tokenizer} -- A Tokenizer is a {@link org.apache.lucene.analysis.TokenStream}
and is responsible for breaking
+    up incoming text into {@link org.apache.lucene.analysis.Token}s.  In most cases, an Analyzer
will use a Tokenizer as the first step in
+    the analysis process.</li>
+    <li>{@link org.apache.lucene.analysis.TokenFilter} -- A TokenFilter is also a {@link
org.apache.lucene.analysis.TokenStream} and is responsible
+    for modifying {@link org.apache.lucene.analysis.Token}s that have been created by the
Tokenizer.  Common modifications performed by a
+    TokenFilter are: deletion, stemming, synonym injection, and down casing.  Not all Analyzers
require TokenFilters</li>
+  </ul>
+<h2>Hints, Tips and Traps</h2>
+   The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer}
+   is sometimes confusing. To ease on this confusion, some clarifications:
+   <ul>
+      <li>The {@link org.apache.lucene.analysis.Analyzer} is responsible for the entire
task of 
+          <u>creating</u> tokens out of the input text, while the {@link org.apache.lucene.analysis.Tokenizer}
+          is only responsible for <u>breaking</u> the input text into tokens.
Very likely, tokens created 
+          by the {@link org.apache.lucene.analysis.Tokenizer} would be modified or even omitted

+          by the {@link org.apache.lucene.analysis.Analyzer} before being returned.
+       </li>
+       <li>{@link org.apache.lucene.analysis.Tokenizer} is a {@link org.apache.lucene.analysis.TokenStream},

+           but {@link org.apache.lucene.analysis.Analyzer} is not.
+       </li>
+       <li>{@link org.apache.lucene.analysis.Analyzer} is "field aware", but 
+           {@link org.apache.lucene.analysis.Tokenizer} is not.
+       </li>
+   </ul>
+<p>Lucene Java provides a number of analysis capabilities, the most commonly used one
being the {@link
+  org.apache.lucene.analysis.standard.StandardAnalyzer}.  Many applications will have a long
and industrious life with nothing more
+  than the StandardAnalyzer.  However, there are a few other classes/packages that are worth
+  <ol>
+    <li>{@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} -- Most Analyzers
perform the same operation on all
+      {@link org.apache.lucene.document.Field}s.  The PerFieldAnalyzerWrapper can be used
to associate a different Analyzer with different
+      {@link org.apache.lucene.document.Field}s.</li>
+    <li>The contrib/analyzers library located at the root of the Lucene distribution
has a number of different Analyzer implementations to solve a variety
+    of different problems related to searching.  Many of the Analyzers are designed to analyze
non-English languages.</li>
+    <li>The contrib/snowball library located at the root of the Lucene distribution
has Analyzer and TokenFilter implementations for a variety of Snowball stemmers.  See <a
href=""></a> for more information.</li>
+    <li>There are a variety of Tokenizer and TokenFilter implementations in this package.
 Take a look around, chances are someone has implemented what you need.</li>
+  </ol>
+<p>Analysis is one of the main causes of performance degradation during indexing. 
Simply put, the more you analyze the slower the indexing (in most cases).
+  Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer}
combined with a
+  {@link org.apache.lucene.analysis.StopFilter}.</p>
+<h2>Implementing your own Analyzer</h2>
+<p>Creating your own Analyzer is straightforward. It usually involves either wrapping
an existing Tokenizer and  set of TokenFilters to create a new Analyzer
+or creating both the Analyzer and a Tokenizer or TokenFilter.  Before pursuing this approach,
you may find it worthwhile
+to explore the contrib/analyzers library and/or ask on the mailing
list first to see if what you need already exists.
+If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer
or TokenFilter) have a look at
+the source code of any one of the many samples located in this package.</p>

View raw message