lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject StrlenFilter contribution and discussion
Date Mon, 01 Mar 2004 20:47:11 GMT

Out of curiosity - does anyone use a Filter based on string (token) 
length. Use case is, say, you're indexing email msgs and if an 
attachment is uuencoded into lines of 60 or whatever characters then you 
  don't want to index tokens that are so long as they can't possibly be 
of use later and just eat up disk space.

Plz feel free to add this to sandbox with whatever license is appropriate.

The code is easy:

/**
  * Removes words that are too long and too short from the stream
  */
public final class StrlenFilter
	extends TokenFilter
{
	/**
	 * Build a filter that removes words that are too long or too short 
from the text.
	 */
	public StrlenFilter(TokenStream in, int min, int max)
	{
		input = in;
		this.min = min;
		this.max =max;
	}

	/** Returns the next input Token whose termText() is the right len
	 */
	public final Token next() throws IOException
	{
		// return the first non-stop word found
		for (Token token = input.next(); token != null; token = input.next())
		{
			final int len = token.termText().length();
			if ( len >= min && len <= max)
				return token;
			// note: else we ignore it but should we index each part of it?
		}
		// reached EOS -- return null		
		return null;
	}
	final int min;
	final int max;
}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message