lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Armbrust <daniel.armbrust.l...@gmail.com>
Subject Token Filter question
Date Thu, 18 Aug 2005 19:51:59 GMT
I am implementing a filter that will remove certain characters from the 
tokens - thing like '(', etc - but the chars to be removed will be 
customizable.

This is what I have come up with - but it doesn't seem very efficient.  
Is there a better way?
Should I be adjusting the token endOffset when I remove characters?  If 
I end up removing all characters, should I be returning null, rather 
than returning a token with no text?



public class CharRemovingFilter extends TokenFilter
{
    StringBuffer temp = new StringBuffer();
    Set          charsToRemove;

    /**
     * Builds a Set from an array of chars to remove, appropriate for 
passing into the
     * CharRemovingFilter constructor.
     */
    public static final Set makeCharRemovalSet(char[] charsToRemove)
    {
        HashSet temp = new HashSet(charsToRemove.length);
        for (int i = 0; i < charsToRemove.length; i++)
        {
            temp.add(new Character(charsToRemove[i]));
        }
        return temp;
    }

    public CharRemovingFilter(TokenStream in, Set charsToRemove)
    {
        super(in);
        this.charsToRemove = charsToRemove;
    }

    public Token next() throws IOException
    {
        Token t = input.next();

        if (t == null)
        {
            return null;
        }

        temp.setLength(0);

        for (int i = 0; i < t.termText().length(); i++)
        {
            if (!charsToRemove.contains(new 
Character(t.termText().charAt(i))))
            {
                temp.append(t.termText().charAt(i));
            }
        }

        Token returnValue = new Token(temp.toString(), t.startOffset(), 
t.endOffset());

        return returnValue;
    }


And here is part of the Analyzer that uses it:

    public final TokenStream tokenStream(String fieldname, final Reader 
reader)
    {
        TokenStream result = new WhitespaceTokenizer(reader);
        result = new LowerCaseFilter(result);
        if (stopTable != null)
        {
            result = new StopFilter(result, stopTable);
        }
        if (charRemovalTable != null)
        {
            result = new CharRemovingFilter(result, charRemovalTable);
        }

        return result;
    }

Thanks,

Dan

-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message