lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject NullTokenizer
Date Tue, 22 Jan 2002 16:27:48 GMT
> From: []
>   Added:       src/java/org/apache/lucene/analysis
>   Log:
>   added NullTokenizer/NullAnalyzer which just 
> pass through space-separated tokens unmodified (mostly for 
> testing purposes)

NullTokenizer is almost exactly like LetterTokenizer, except that instead of
checking for Character.isLetter it checks for !Character.isWhitespace.
Perhaps we should make both of these subclasses of a common base class, with
a protected isTokenChar method that each implements?  It's a shame to have
so much code duplication.

We also have LowerCaseTokenizer, which your wrote and which is almost the
same code again.  Maybe the base class could also have a normalizeCharacter
method that in LetterTokenizer does nothing but in LowercaseTokenizer calls

NullTokenizer and NullAnalyzer are also not very descriptive names.  I would
prefer WhitespaceTokenizer and WhitespaceAnalyzer.  But if these are really
only used by the test code, and the above base-class strategry were
implemented, then these could just become an anonymous classes like:
  Analyzer analyzer = new Analyzer() {
    public TokenStream tokenStream(Reader reader) {
      return new CharTokenizer(reader) {
        protected boolean isTokenChar(char c) {
          return !Character.isWhitespace(c);
That way org.apache.lucene.analysis wouldn't be cluttered by classes not of
general interest.

Do you agree with this proposal?  If so, would you like to implement it, or
shall I?


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message