lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graham Sugden <gras...@gmail.com>
Subject Re: Analysis
Date Mon, 22 Aug 2011 11:28:29 GMT
Caveat to the below is that I am very new to lucene. (That said though,
following the below strategy, after a couple of days work I have a set of
per field analyzers for various languages, using various custom filters,
caching of initial analysis; and capable of outputting stemmed, reversed,
diacritic/accent-less content, which is a lot more than I expected when I
started out--hat tip to all those developers of lucene!)

I found this

http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org.apache.lucene.analysis.htm

together with looking at the code of current implementations (analysis
package and contrib analyzers) was a good way to get up and running fairly
quickly:

[1]
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/

[2]
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/

[3]
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/

Code and documentation for StandardAnalyzer [2] (createComponents override)
and its class heirarchy: StopWordAnalyzerBase [1] extends
ReusableAnalyzerBase [1] extends Analyzer [1] are where I started. Generally
creating your own analyzers will be a matter of overriding the tokenStream
and reusableTokenStream methods either directly or, if ReusableAnalyzerBase
is in the heirarchy (it usually is if you are using any of the language
analyzers), indirectly by overriding the createComponents method. The idiom
is then usually

{
  src = new SomeTokenizer(..., reader,...)
  tokenStream = new SomeFilter(...,src,...)
  tokenStream = new AnotherFilter(...,tokenStream,...)
  ...
  tokenStream = new YetAnotherFilter(...,tokenStream,...)
  // if overriding createComponents
  return new TokenStreamComponents(src,tokenStream);
 // else
 // return tokenStream.
}

For filters and attribute use (see tutorial link) I found LowerCaseFilter
[1] (for use of CharTermAttribute) FilteringTokenFilter [1] (for use of
PositionIncrementAttribute) and SynonymFilter [3 (synonyms/)] helpful.

Other classes I have found useful to know at this stage are:
  PerFieldAnalyzerWrapper (derives from Analyzer, so overrides tokenStream &

    reuseableTokenStream): useful for applying different analysis to
individual fields
    of a single document.
  CachingTokenFilter & TeeSinkTokenFilter: useful for avoiding duplication
of
    (expensive) analysis where fields share a common initial analysis.

So far I have found currently available tokenizers meet my needs, so I have
not looked at implementing my own yet; though the code base is probably as
good a place as any to start for that too, after which I would guess the
parsing of the input stream would become the complicated bit. Maybe someone
else can chip in on that?

Hope this helps, kind regard, graham


On Mon, Aug 22, 2011 at 8:10 AM, Saar Carmi <saarcarmi@gmail.com> wrote:

> Hi
> Where can I find a  guide for building analyzers, filters and tokenizers?
>
> Saar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message