lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hackl, Rene" <Rene.Ha...@FIZ-Karlsruhe.DE>
Subject Ngram index
Date Tue, 30 Sep 2003 15:01:18 GMT
Just an idea about what Leo said yesterday.

> what about bi/tri-grams + some sort of hit filtering? It will do the 
> job. I just saw some ineffective implementation of 1-grams for CJK on 
> lucene-dev@. It could be a good starting point for full n-gram 
> support... Just a thought.

A change in the AliasFilter seemed to work:

private void addAliasesToStack(Token token, Stack aliasStack) {
     if(token == null) return;

     String tokenString = token.termText();
     String tokenSubString = "";
               
     // --- from here ---  
     int x = 0;     
     while( tokenString.length() > x+2 ) {     	
     	tokenSubString += tokenString.substring( x, x+3 );
     	tokenSubString += " ";     	
     	x++;     		
     }
     // --- to here ---

     //System.out.println( "SUBSTRING ELEMENTS: "+tokenSubString );

     StringTokenizer tokenizer = new StringTokenizer(tokenSubString, " ");
     while(tokenizer.hasMoreElements()) {
       String nextAlias = tokenizer.nextToken();       
       Token nextTokenAlias = new Token(nextAlias, 0, nextAlias.length());

       aliasStack.push(nextTokenAlias);
     }
   }

This snippet creates overlapping tri-grams. But I don't know if this is of
any use, a mere notion.

Best regards,

René Hackl

Mime
View raw message