lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spencer, Dave" <d...@lumos.com>
Subject Is there a StrlenFilter yet?
Date Mon, 21 Oct 2002 23:20:11 GMT

Use case - you want to protect yourself against pathalogical docs
such as one with a string of a million consectutive characters - any
normal tokenizer will consider this one big token but there's probably
no point in indexing a string that is a million characters long.
One example is indexing a mailing list which could contain uuencoded
attachments - there
could be lots of lamo lines 72 or so chars long.

Anyway - I've attached a possible impl.

Discussion question is, let's say the filter is told to only return
tokens <= 5 chars long (note:
I think 16 or so would be more realistic for most docs -this is just for
sake of example).

What if there is one 6 chars long then i.e. longer than the limit - say
it 
is "abcdef".

Then either:

[a] we ignore "abcdef" and assume it is garbage
or
[b] we return "abcde" and "bcdef" i.e. all 5 char substrings
of it, so that if someone wants to search on the 6 char string they
sort of still can (at least w/ a carefully chosen query...hmmm..).

Anyway here's some code.
If popular it could be put into StandardAnalyzer.

--------
package com.tropo.lucene;

import java.io.IOException;
import org.apache.lucene.analysis.*;

/**
 * Removes words that are too long and too short from the stream
 */
public final class StrlenFilter
	extends TokenFilter
{
	/**
	 * Build a filter that removes words that are too long or too
short from the text.
	 */
	public StrlenFilter(TokenStream in, int min, int max)
	{
		input = in;
		this.min = min;
		this.max =max;
	}

	/** Returns the next input Token whose termText() is the right
len
	 */
	public final Token next() throws IOException
	{
		// return the first non-stop word found
		for (Token token = input.next(); token != null; token =
input.next())
		{
			final int len = token.termText().length();
			if ( len >= min && len <= max)
				return token;
			// note: else we ignore it but should we index
each part of it?
		}
		// reached EOS -- return null		
		return null;
	}
	final int min;
	final int max;
	}



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message