lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ilma barbosa <ilmabarb...@yahoo.com.br>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 17:53:10 GMT
 --- Erik Hatcher <erik@ehatchersolutions.com>
escreveu: > On Monday, September 29, 2003, at 09:19 
AM, Hackl,
> Rene wrote:
> > I'm looking for a way to implement simultaneous
> left and right 
> > truncation.
> >
> > The goal is to enable the user to search for e.g.
> "*hydronaphth*" and 
> > find
> > "hexahydronaphthalene" as well as
> "heptahydronaphthalin".
> 
> WildcardQuery, if used directly but not through
> QueryParser, allows for 
> left and right wildcards like this.  Performance is
> the biggest concern 
> though, as it will have to enumerate all terms in
> the index to look for 
> matches.
> 
> > To achieve that functionality, I'd like to index
> terms in the way that 
> > from
> > a token "foobar" the tokens "oobar" and "obar" (
> e.g. mininum word 
> > length =
> > 4)
> > would be derived and added to the index. I tried
> to extend 
> > TokenFilter, but
> > all I get is either "oobar" or "obar", depends on
> when 'return' is 
> > called.
> 
> This is an interesting approach and would certainly
> give you the search 
> results you are looking for, except that you'll be
> indexing a ton of 
> terms I'd guess.  If there is some other way to
> split these words by 
> separating by prefix ("hexa", "hepta") and suffix
> ("alene", "alin") it 
> would likely be better.  But maybe its not practical
> to do so.
> 
> But, to the main question....
> 
> > How could I add such extra tokens to the
> tokenStream? Any thoughts on 
> > this
> > appreciated.
> 
> Yes, this can be done.  Craig Walls did this with an
> example Analyzer 
> in his December '02 Java Developer's Journal
> article.  I adapted his 
> example to use to demonstrate this for
> presentations.  Here is my code:
> 
> import java.io.IOException;
> import java.util.Enumeration;
> import java.util.HashMap;
> import java.util.ResourceBundle;
> import java.util.Stack;
> import java.util.StringTokenizer;
> 
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> 
> public class AliasFilter extends TokenFilter {
>    private static HashMap aliasMap = new HashMap();
> 
>    static {
>        aliasMap.put("country","yeehaw");
>        aliasMap.put("western","rawhide");
>    }
> 
>    private Stack currentTokenAliases;
> 
>    public AliasFilter(TokenStream in) {
>      currentTokenAliases = new Stack();
>      input = in;
>    }
> 
>    public Token next() throws IOException {
>      if(currentTokenAliases.size() > 0) {
>        return (Token)currentTokenAliases.pop();
>      }
> 
>      Token nextToken = input.next();
>      addAliasesToStack(nextToken,
> currentTokenAliases);
> 
>      return nextToken;
>    }
> 
>    private void addAliasesToStack(Token token, Stack
> aliasStack) {
>      if(token == null) return;
> 
>      String aliasString =
> (String)aliasMap.get(token.termText());
> 
>      if(aliasString == null || aliasString.length()
> < 1) return;
> 
>      StringTokenizer tokenizer = new
> StringTokenizer(aliasString, " ");
>      while(tokenizer.hasMoreElements()) {
>        String nextAlias = tokenizer.nextToken();
>        Token nextTokenAlias = new Token(nextAlias,
> 0, 
> nextAlias.length());
>        aliasStack.push(nextTokenAlias);
>      }
>    }
> }
> 
> Note that this is a braindead example of injecting
> tokens into the 
> stream, so when "country" or "western" is
> encountered, another token is 
> added to the stream.  In your case you'll do
> something with the initial 
> token and push all its 4+ length pieces onto the
> stack.
> 
> 	Erik
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
>  



Mime
View raw message