lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 13:50:27 GMT
On Monday, September 29, 2003, at 09:19  AM, Hackl, Rene wrote:
> I'm looking for a way to implement simultaneous left and right 
> truncation.
>
> The goal is to enable the user to search for e.g. "*hydronaphth*" and 
> find
> "hexahydronaphthalene" as well as "heptahydronaphthalin".

WildcardQuery, if used directly but not through QueryParser, allows for 
left and right wildcards like this.  Performance is the biggest concern 
though, as it will have to enumerate all terms in the index to look for 
matches.

> To achieve that functionality, I'd like to index terms in the way that 
> from
> a token "foobar" the tokens "oobar" and "obar" ( e.g. mininum word 
> length =
> 4)
> would be derived and added to the index. I tried to extend 
> TokenFilter, but
> all I get is either "oobar" or "obar", depends on when 'return' is 
> called.

This is an interesting approach and would certainly give you the search 
results you are looking for, except that you'll be indexing a ton of 
terms I'd guess.  If there is some other way to split these words by 
separating by prefix ("hexa", "hepta") and suffix ("alene", "alin") it 
would likely be better.  But maybe its not practical to do so.

But, to the main question....

> How could I add such extra tokens to the tokenStream? Any thoughts on 
> this
> appreciated.

Yes, this can be done.  Craig Walls did this with an example Analyzer 
in his December '02 Java Developer's Journal article.  I adapted his 
example to use to demonstrate this for presentations.  Here is my code:

import java.io.IOException;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.ResourceBundle;
import java.util.Stack;
import java.util.StringTokenizer;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class AliasFilter extends TokenFilter {
   private static HashMap aliasMap = new HashMap();

   static {
       aliasMap.put("country","yeehaw");
       aliasMap.put("western","rawhide");
   }

   private Stack currentTokenAliases;

   public AliasFilter(TokenStream in) {
     currentTokenAliases = new Stack();
     input = in;
   }

   public Token next() throws IOException {
     if(currentTokenAliases.size() > 0) {
       return (Token)currentTokenAliases.pop();
     }

     Token nextToken = input.next();
     addAliasesToStack(nextToken, currentTokenAliases);

     return nextToken;
   }

   private void addAliasesToStack(Token token, Stack aliasStack) {
     if(token == null) return;

     String aliasString = (String)aliasMap.get(token.termText());

     if(aliasString == null || aliasString.length() < 1) return;

     StringTokenizer tokenizer = new StringTokenizer(aliasString, " ");
     while(tokenizer.hasMoreElements()) {
       String nextAlias = tokenizer.nextToken();
       Token nextTokenAlias = new Token(nextAlias, 0, 
nextAlias.length());
       aliasStack.push(nextTokenAlias);
     }
   }
}

Note that this is a braindead example of injecting tokens into the 
stream, so when "country" or "western" is encountered, another token is 
added to the stream.  In your case you'll do something with the initial 
token and push all its 4+ length pieces onto the stack.

	Erik


Mime
View raw message