lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Token Filter question
Date Fri, 19 Aug 2005 07:13:41 GMT
On Thursday 18 August 2005 21:51, Dan Armbrust wrote:
> I am implementing a filter that will remove certain characters from the 
> tokens - thing like '(', etc - but the chars to be removed will be 
> customizable.
> 
> This is what I have come up with - but it doesn't seem very efficient.  
> Is there a better way?
> Should I be adjusting the token endOffset when I remove characters?  If 
> I end up removing all characters, should I be returning null, rather 
> than returning a token with no text?
> 
> 
> 
> public class CharRemovingFilter extends TokenFilter
> {
>     StringBuffer temp = new StringBuffer();
>     Set          charsToRemove;
> 
>     /**
>      * Builds a Set from an array of chars to remove, appropriate for 
> passing into the
>      * CharRemovingFilter constructor.
>      */
>     public static final Set makeCharRemovalSet(char[] charsToRemove)
>     {
>         HashSet temp = new HashSet(charsToRemove.length);
>         for (int i = 0; i < charsToRemove.length; i++)
>         {
>             temp.add(new Character(charsToRemove[i]));
>         }
>         return temp;
>     }
> 
>     public CharRemovingFilter(TokenStream in, Set charsToRemove)
>     {
>         super(in);
>         this.charsToRemove = charsToRemove;
>     }
> 
>     public Token next() throws IOException
>     {
>         Token t = input.next();
> 
>         if (t == null)
>         {
>             return null;
>         }
> 
>         temp.setLength(0);
> 
>         for (int i = 0; i < t.termText().length(); i++)
>         {
>             if (!charsToRemove.contains(new 
> Character(t.termText().charAt(i))))

I'd expect String.indexOf( .... .charAt(i)) to be faster than
creating a new Character.

For this for loop, using String.replace() of java 1.5 might be faster
than hashing. Removing the replaced characters would need
one more scan.
What is the maximum int value of a Unicode character anyway?

Other than that, when the int value of the input characters
is small enough, one can create a table mapping
all chars to their replacements and index this table.

Removing the replaced chars can then be done by not
incrementing the target index for a replacement character
indicating deletion.

Inlining the t.termText() method call will probably help in any case.


Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message