lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Token Filter question
Date Thu, 18 Aug 2005 20:58:40 GMT
On Aug 18, 2005, at 3:51 PM, Dan Armbrust wrote:
> I am implementing a filter that will remove certain characters from  
> the tokens - thing like '(', etc - but the chars to be removed will  
> be customizable.
>
> This is what I have come up with - but it doesn't seem very  
> efficient.  Is there a better way?

Without taking the time to look at your code much, here are some  
things to note....

> Should I be adjusting the token endOffset when I remove characters?

This really depends on what you plan on doing with the offsets.  If  
you're not using them at all, then it doesn't matter.  But if you're  
doing hit highlighting then it will matter and the offsets provide  
the positions to highlight.  If you've got text that says "(foo)" and  
you want searches for "foo" to highlight only "foo" but not "(foo)"  
then you'll want to adjust the offsets accordingly (this is presuming  
your filter is seeing "(foo)" as a token)

>   If I end up removing all characters, should I be returning null,  
> rather than returning a token with no text?

If you return null, the analysis process ends thinking that is the  
end of the token stream.  Rather what you want to do is grab the next  
token and process it and be sure to return successive tokens through  
your filter, and only null at the end of them all.

     Erik


>
>
>
> public class CharRemovingFilter extends TokenFilter
> {
>    StringBuffer temp = new StringBuffer();
>    Set          charsToRemove;
>
>    /**
>     * Builds a Set from an array of chars to remove, appropriate  
> for passing into the
>     * CharRemovingFilter constructor.
>     */
>    public static final Set makeCharRemovalSet(char[] charsToRemove)
>    {
>        HashSet temp = new HashSet(charsToRemove.length);
>        for (int i = 0; i < charsToRemove.length; i++)
>        {
>            temp.add(new Character(charsToRemove[i]));
>        }
>        return temp;
>    }
>
>    public CharRemovingFilter(TokenStream in, Set charsToRemove)
>    {
>        super(in);
>        this.charsToRemove = charsToRemove;
>    }
>
>    public Token next() throws IOException
>    {
>        Token t = input.next();
>
>        if (t == null)
>        {
>            return null;
>        }
>
>        temp.setLength(0);
>
>        for (int i = 0; i < t.termText().length(); i++)
>        {
>            if (!charsToRemove.contains(new Character(t.termText 
> ().charAt(i))))
>            {
>                temp.append(t.termText().charAt(i));
>            }
>        }
>
>        Token returnValue = new Token(temp.toString(), t.startOffset 
> (), t.endOffset());
>
>        return returnValue;
>    }
>
>
> And here is part of the Analyzer that uses it:
>
>    public final TokenStream tokenStream(String fieldname, final  
> Reader reader)
>    {
>        TokenStream result = new WhitespaceTokenizer(reader);
>        result = new LowerCaseFilter(result);
>        if (stopTable != null)
>        {
>            result = new StopFilter(result, stopTable);
>        }
>        if (charRemovalTable != null)
>        {
>            result = new CharRemovingFilter(result, charRemovalTable);
>        }
>
>        return result;
>    }
>
> Thanks,
>
> Dan
>
> -- 
> ****************************
> Daniel Armbrust
> Biomedical Informatics
> Mayo Clinic Rochester
> daniel.armbrust(at)mayo.edu
> http://informatics.mayo.edu/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message