lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject How do you see if a tokenstream has tokens without consuming the tokens ?
Date Mon, 17 Oct 2011 12:12:59 GMT

We have a modified version of a Lucene StandardAnalyzer , we use it for 
tokenizing music metadata such as as artist names & song titles, so 
typically only a few words. On tokenizing it usually it strips out 
punctuations which is correct, however if the input text consists of 
only punctuation characters then we end up with nothing, for these 
particular RARE cases I want to use a mapping filter.

So what I try to do is have my analyzer tokenize as normal, then if the 
results is no tokens retokenize with the mapping filter , I check it has 
no token using incrementToken() but then cant see how I 
decrementToken(). How can I do this, or is there a more efficient way of 
doing this. Note of maybe 10,000,000 records only a few 100 records will 
have this problem so I need a solution which doesn't impact performance 

     NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
     specialcharConvertMap.add("!", "Exclamation");

     public  TokenStream tokenStream(String fieldName, Reader reader) {
         CharFilter specialCharFilter = new 

         StandardTokenizer tokenStream = new 
                 tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
                 //TODO **************** set tokenstream back as it was 
before increment token
         catch(IOException ioe)

         TokenStream result = new LowercaseFilter(result);
         return result;

thanks for any help


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message