lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierrick Brihaye <pierrick.brih...@culture.gouv.fr>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 13:46:17 GMT
Hi,

Hackl, Rene a écrit:

> I tried to extend TokenFilter, but 
> all I get is either "oobar" or "obar", depends on when 'return' is called. 
> 
> How could I add such extra tokens to the tokenStream? Any thoughts on this
> appreciated.

Adapted from my... arabic Analyzer :

public class ChemicalOrSomethingFilter extends TokenFilter {
	
   private Token receivedToken = null;
   private StringBuffer receivedText = new StringBuffer();	
	
   public ChemicalOrSomethingFilter(TokenStream input){
     super(input);
   }
	
   private String getNextTruncation() {		
     StringBuffer emittedText = new StringBuffer();
     //left trim the token
     while(true) {
       if (receivedText.length() == 0) break;
       char c = receivedText.charAt(0);
       if (!Character.isWhitespace(c)) break;
       receivedText.deleteCharAt(0);		
     }				
     //keep the good stuff
     while(true) {
       if (receivedText.length() == 0) break;
       char c = receivedText.charAt(0);
       if (Character.isWhitespace(c)) break;
       emittedText.append(receivedText.charAt(0));
       receivedText.deleteCharAt(0);			
     }		
     //right trim the token
     while(true) {
      if (receivedText.length() == 0) break;
      char c = receivedText.charAt(0);
      if (!Character.isWhitespace(c)) break;
      receivedText.deleteCharAt(0);		
     }				
     return emittedText.toString();
   }

   public final Token next() throws IOException {
     while (true) {
       String emittedText;	
       int positionIncrement = 0;	
       //New token ?
       if (receivedText.length() == 0) {
         receivedToken = input.next();			
         if (receivedToken == null) return null;
         receivedText.append(receivedToken.termText());		
         positionIncrement = 1;		
       }
       emittedText = getNextPart();			
       //Warning : all tokens are emitted with the *same* offset
       if (emittedText.length() > 0) {
         Token emittedToken = new Token(emittedText, 
receivedToken.startOffset(), receivedToken.endOffset());			 
emittedToken.setPositionIncrement(positionIncrement);
         return emittedToken;
       }				
     }
   }
}

Not tested at all : it is a quick copy of my "WhiteSpaceFilter" (that's 
why triming is so important up there ;-) which is not the best designed 
class.

This should work for indexing. For querying, it's another matter 
especially if you want to use the queryParser.

Keep us informed.

Cheers,

-- 
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:pierrick.brihaye@culture.fr


Mime
View raw message