Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: Is there a StrlenFilter yet?
Date: Mon, 21 Oct 2002 16:20:11 -0700
Message-ID: <728DA21B8941A843A7C496F1ACF485185AD374@gleam.lumos.com>
Thread-Topic: [patch] Factory method to preload a Directory into a
 RAMDirectory
Thread-Index: AcJ4Xn3wh4tZaJXoRzWkPsNYFMDTygA9Sxkg
From: "Spencer, Dave" <dave@lumos.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>


Use case - you want to protect yourself against pathalogical docs
such as one with a string of a million consectutive characters - any
normal tokenizer will consider this one big token but there's probably
no point in indexing a string that is a million characters long.
One example is indexing a mailing list which could contain uuencoded
attachments - there
could be lots of lamo lines 72 or so chars long.

Anyway - I've attached a possible impl.

Discussion question is, let's say the filter is told to only return
tokens <=3D 5 chars long (note:
I think 16 or so would be more realistic for most docs -this is just for
sake of example).

What if there is one 6 chars long then i.e. longer than the limit - say
it=20
is "abcdef".

Then either:

[a] we ignore "abcdef" and assume it is garbage
or
[b] we return "abcde" and "bcdef" i.e. all 5 char substrings
of it, so that if someone wants to search on the 6 char string they
sort of still can (at least w/ a carefully chosen query...hmmm..).

Anyway here's some code.
If popular it could be put into StandardAnalyzer.

--------
package com.tropo.lucene;

import java.io.IOException;
import org.apache.lucene.analysis.*;

/**
 * Removes words that are too long and too short from the stream
 */
public final class StrlenFilter
	extends TokenFilter
{
	/**
	 * Build a filter that removes words that are too long or too
short from the text.
	 */
	public StrlenFilter(TokenStream in, int min, int max)
	{
		input =3D in;
		this.min =3D min;
		this.max =3Dmax;
	}

	/** Returns the next input Token whose termText() is the right
len
	 */
	public final Token next() throws IOException
	{
		// return the first non-stop word found
		for (Token token =3D input.next(); token !=3D null; token =3D
input.next())
		{
			final int len =3D token.termText().length();
			if ( len >=3D min && len <=3D max)
				return token;
			// note: else we ignore it but should we index
each part of it?
		}
		// reached EOS -- return null	=09
		return null;
	}
	final int min;
	final int max;
	}


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>