lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject NullTokenizer
Date Tue, 22 Jan 2002 16:27:48 GMT
> From: briangoetz@apache.org [mailto:briangoetz@apache.org]
>
>   Added:       src/java/org/apache/lucene/analysis NullAnalyzer.java
>                         NullTokenizer.java
>   Log:
>   added NullTokenizer/NullAnalyzer which just 
> pass through space-separated tokens unmodified (mostly for 
> testing purposes)

NullTokenizer is almost exactly like LetterTokenizer, except that instead of
checking for Character.isLetter it checks for !Character.isWhitespace.
Perhaps we should make both of these subclasses of a common base class, with
a protected isTokenChar method that each implements?  It's a shame to have
so much code duplication.

We also have LowerCaseTokenizer, which your wrote and which is almost the
same code again.  Maybe the base class could also have a normalizeCharacter
method that in LetterTokenizer does nothing but in LowercaseTokenizer calls
toLowercase.

NullTokenizer and NullAnalyzer are also not very descriptive names.  I would
prefer WhitespaceTokenizer and WhitespaceAnalyzer.  But if these are really
only used by the test code, and the above base-class strategry were
implemented, then these could just become an anonymous classes like:
  Analyzer analyzer = new Analyzer() {
    public TokenStream tokenStream(Reader reader) {
      return new CharTokenizer(reader) {
        protected boolean isTokenChar(char c) {
          return !Character.isWhitespace(c);
        }
      }
    }
  };
That way org.apache.lucene.analysis wouldn't be cluttered by classes not of
general interest.

Do you agree with this proposal?  If so, would you like to implement it, or
shall I?

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message