lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Need an analyzer that includes numbers.
Date Sun, 26 Dec 2004 09:06:41 GMT

On Dec 25, 2004, at 11:05 AM, Jim wrote:

> I've seen some discussion on this and the answer seems to be "write 
> your own".  Hasn't someone already done that by now that would share?  
> I really have to be able to include numeric and alphanumeric strings 
> in my searches.   I don't understand analyzers well enough to roll my 
> own.

This is more involved than just keeping numbers around... or at least 
there are more steps to consider.  Do you want the alpha characters 
lower-cased, which is the typical behavior so that searches are 
case-insensitive.  What about punctuation characters?  Generally these 
get tossed, however there are cases where that is not desired either.

The good news is that writing Tokenizer and TokenFilter pieces of an 
analyzer are generally relatively easy.  There are a number of built-in 
Lucene pieces that you can leverage.  I whipped up a quick 
AlphanumericAnalyzer for you demonstrating the CharTokenizer which 
treats alphanumeric characters as part of tokens, and any other 
character as a separator that gets thrown away.  At the same time, it 
lowercases.  The output of the main() method is shown below also.

public class AlphanumericAnalyzer extends Analyzer {
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new CharTokenizer(reader) {
       protected char normalize(char c) {
         return Character.toLowerCase(c);
       }

       protected boolean isTokenChar(char c) {
         return Character.isLetter(c) || Character.isDigit(c);
       }
     };
   }


   public static void main(String[] args) throws IOException {
     TokenStream ts =
         new AlphanumericAnalyzer().tokenStream("field",
             new StringReader("December 26, 2004"));

     String month = ts.next().termText();
     String day = ts.next().termText();
     String year = ts.next().termText();

     System.out.println(month + " " + day + " " + year);
   }

}


Output:
december 26 2004

Calling .tokenStream and .next().termText() is not something your 
production code would need to do - but its what happens under the 
covers of Lucene.  If you are going to write a custom analyzer, you 
*should* write unit tests that "analyze" the analyzer using these 
lower-level methods.

Lucene in Action goes into the analysis topic deeply, but simply, and I 
spent a great deal of time toying with different customizations to 
analyzers to write about them.  The sample code distribution includes 
utility methods and unit test helpers to illustrate, test, and debug 
the analysis process.  And in retrospect, this very example I cobbled 
together to reply to this e-mail would have been a great example to add 
as well.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message