lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ilma barbosa <ilmabarb...@yahoo.com.br>
Subject Re: derive tokens from single token
Date Mon, 29 Sep 2003 17:55:46 GMT
 --- ilma barbosa <ilmabarbosa@yahoo.com.br> escreveu:
>  --- Erik Hatcher <erik@ehatchersolutions.com>
> escreveu: > On Monday, September 29, 2003, at 09:19 
> AM, Hackl,
> > Rene wrote:
> > > I'm looking for a way to implement simultaneous
> > left and right 
> > > truncation.
> > >
> > > The goal is to enable the user to search for
> e.g.
> > "*hydronaphth*" and 
> > > find
> > > "hexahydronaphthalene" as well as
> > "heptahydronaphthalin".
> > 
> > WildcardQuery, if used directly but not through
> > QueryParser, allows for 
> > left and right wildcards like this.  Performance
> is
> > the biggest concern 
> > though, as it will have to enumerate all terms in
> > the index to look for 
> > matches.
> > 
> > > To achieve that functionality, I'd like to index
> > terms in the way that 
> > > from
> > > a token "foobar" the tokens "oobar" and "obar" (
> > e.g. mininum word 
> > > length =
> > > 4)
> > > would be derived and added to the index. I tried
> > to extend 
> > > TokenFilter, but
> > > all I get is either "oobar" or "obar", depends
> on
> > when 'return' is 
> > > called.
> > 
> > This is an interesting approach and would
> certainly
> > give you the search 
> > results you are looking for, except that you'll be
> > indexing a ton of 
> > terms I'd guess.  If there is some other way to
> > split these words by 
> > separating by prefix ("hexa", "hepta") and suffix
> > ("alene", "alin") it 
> > would likely be better.  But maybe its not
> practical
> > to do so.
> > 
> > But, to the main question....
> > 
> > > How could I add such extra tokens to the
> > tokenStream? Any thoughts on 
> > > this
> > > appreciated.
> > 
> > Yes, this can be done.  Craig Walls did this with
> an
> > example Analyzer 
> > in his December '02 Java Developer's Journal
> > article.  I adapted his 
> > example to use to demonstrate this for
> > presentations.  Here is my code:
> > 
> > import java.io.IOException;
> > import java.util.Enumeration;
> > import java.util.HashMap;
> > import java.util.ResourceBundle;
> > import java.util.Stack;
> > import java.util.StringTokenizer;
> > 
> > import org.apache.lucene.analysis.Token;
> > import org.apache.lucene.analysis.TokenFilter;
> > import org.apache.lucene.analysis.TokenStream;
> > 
> > public class AliasFilter extends TokenFilter {
> >    private static HashMap aliasMap = new
> HashMap();
> > 
> >    static {
> >        aliasMap.put("country","yeehaw");
> >        aliasMap.put("western","rawhide");
> >    }
> > 
> >    private Stack currentTokenAliases;
> > 
> >    public AliasFilter(TokenStream in) {
> >      currentTokenAliases = new Stack();
> >      input = in;
> >    }
> > 
> >    public Token next() throws IOException {
> >      if(currentTokenAliases.size() > 0) {
> >        return (Token)currentTokenAliases.pop();
> >      }
> > 
> >      Token nextToken = input.next();
> >      addAliasesToStack(nextToken,
> > currentTokenAliases);
> > 
> >      return nextToken;
> >    }
> > 
> >    private void addAliasesToStack(Token token,
> Stack
> > aliasStack) {
> >      if(token == null) return;
> > 
> >      String aliasString =
> > (String)aliasMap.get(token.termText());
> > 
> >      if(aliasString == null ||
> aliasString.length()
> > < 1) return;
> > 
> >      StringTokenizer tokenizer = new
> > StringTokenizer(aliasString, " ");
> >      while(tokenizer.hasMoreElements()) {
> >        String nextAlias = tokenizer.nextToken();
> >        Token nextTokenAlias = new Token(nextAlias,
> > 0, 
> > nextAlias.length());
> >        aliasStack.push(nextTokenAlias);
> >      }
> >    }
> > }
> > 
> > Note that this is a braindead example of injecting
> > tokens into the 
> > stream, so when "country" or "western" is
> > encountered, another token is 
> > added to the stream.  In your case you'll do
> > something with the initial 
> > token and push all its 4+ length pieces onto the
> > stack.
> > 
> > 	Erik
> > 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> >  
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
>  

=====




Mime
View raw message