lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: StrlenFilter contribution and discussion
Date Tue, 02 Mar 2004 12:48:38 GMT
I remember you sending this once before....a long time ago.
This time I'll stick it in sandbox/contributions/analyzers/...
Thanks!

Otis

--- David Spencer <dave-lucene-user@tropo.com> wrote:
> 
> Out of curiosity - does anyone use a Filter based on string (token) 
> length. Use case is, say, you're indexing email msgs and if an 
> attachment is uuencoded into lines of 60 or whatever characters then
> you 
>   don't want to index tokens that are so long as they can't possibly
> be 
> of use later and just eat up disk space.
> 
> Plz feel free to add this to sandbox with whatever license is
> appropriate.
> 
> The code is easy:
> 
> /**
>   * Removes words that are too long and too short from the stream
>   */
> public final class StrlenFilter
> 	extends TokenFilter
> {
> 	/**
> 	 * Build a filter that removes words that are too long or too short 
> from the text.
> 	 */
> 	public StrlenFilter(TokenStream in, int min, int max)
> 	{
> 		input = in;
> 		this.min = min;
> 		this.max =max;
> 	}
> 
> 	/** Returns the next input Token whose termText() is the right len
> 	 */
> 	public final Token next() throws IOException
> 	{
> 		// return the first non-stop word found
> 		for (Token token = input.next(); token != null; token =
> input.next())
> 		{
> 			final int len = token.termText().length();
> 			if ( len >= min && len <= max)
> 				return token;
> 			// note: else we ignore it but should we index each part of it?
> 		}
> 		// reached EOS -- return null		
> 		return null;
> 	}
> 	final int min;
> 	final int max;
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message