lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by SteveRowe
Date Wed, 28 Dec 2011 03:22:28 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The "LuceneFAQ" page has been changed by SteveRowe:
http://wiki.apache.org/lucene-java/LuceneFAQ?action=diff&rev1=157&rev2=158

Comment:
modernize custom analyzer examples

  Here is an example:
  
  {{{
- public class MyAnalyzer extends Analyzer
+ public class MyAnalyzer extends ReusableAnalyzerBase {
+   private Version matchVersion;
- {
-     private static final Analyzer STANDARD = new StandardAnalyzer();
  
+   public MyAnalyzer(Version matchVersion) {
+     this.matchVersion = matchVersion;
-     public TokenStream tokenStream(String field, final Reader reader)
-     {
-         // do not tokenize field called 'element'
-         if ("element".equals(field)) {
-             return new CharTokenizer(reader) {
-                 protected boolean isTokenChar(char c) {
-                     return true;
-                 }
-             };
-         } else {
-             // use standard analyzer
-             return STANDARD.tokenStream(field, reader);
-         }
-     }
+   }
+ 
+   @Override
+   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+     final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
+     TokenStream sink = new LowerCaseFilter(matchVersion, source);
+     sink = new LengthFilter(sink, 3, Integer.MAX_VALUE);
+     return new TokenStreamComponents(source, sink);
+   }
  }
  }}}
  All that being said, most of the heavy lifting in custom analyzers is done by calls to custom
subclasses of TokenFilter.
@@ -428, +424 @@

  If you want your custom token modification to come after the filters that lucene's StandardAnalyzer
class would normally call, do the following:
  
  {{{
+   @Override
+   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+     final Tokenizer source = new StandardTokenizer(matchVersion, reader);
+     TokenStream sink = new StandardFilter(matchVersion, source);
+     sink = new LowerCaseFilter(matchVersion, sink);
+     sink = new StopFilter(matchVersion, sink,
- return new NameFilter(
-         CaseNumberFilter(
-                 new StopFilter(
-                         new LowerCaseFilter(
-                                 new StandardFilter(
-                                         new StandardTokenizer(reader)
-                         )
-                 ), StopAnalyzer.ENGLISH_STOP_WORDS)
+                           StopAnalyzer.ENGLISH_STOP_WORDS_SET, false);
-         )
- );
+     sink = new CaseNumberFilter(sink);
+     sink = new NameFilter(sink);
+     return new TokenStreamComponents(source, sink);
+   }
  }}}
  ==== How do I index non Latin characters? ====
  Lucene only uses Java strings, so you normally do not need to care about this. Just remember
that you may need to specify an encoding when you read in external strings from e.g. a file
(otherwise the system's default encoding will be used). If you really need to recode a String
you can use this hack:

Mime
View raw message