Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <ff737ac30407150028273b5538@mail.gmail.com>
Date: Thu, 15 Jul 2004 09:28:00 +0200
From: Giulio Cesare Solaroli <giulio.cesare@gmail.com>
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Suggestion to improve the LenghtFilter
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Hi all,

the LenghtFilter available in the Lucene sandbox drops altogether all
tokens with a lenght outside the range defined during the construction
of the filter.

In my opinion, dropping longer tokens is too drastic; it should be
much better to truncate the token and index only the first part.

Eventually, it would be possible to set another (much higher limit)
for tokens to be ignored (to avoid filenames with full path or other
strange things that could be found inside documents).

In this way you can safely set a "short" maximum length (we are trying
with a value of 12) without fearing to loose any meaningful
information. If the tokens exceding the maximum lenght were to be
dropped, a much higher value would be needed (possibly around 20).

Having a maximum length of the tokens of 12 should help reducing the
number of distinct token stored into the index and thus improving the
"explosion" of wild char used in queries.

This is our alternative implementation of the next() of the
LengthFilter to handle this policy:

-----------------------------

public org.apache.lucene.analysis.Token next() throws java.io.IOException {
	org.apache.lucene.analysis.Token	result;

	do {
		result = input.next();
	} while ((result != null) && (result.termText().length() <=
this.minTokenSize()));
		
	if ((result != null) && (result.termText().length() > this.maxTokenSize())) {
		logger.debug(result.termText().substring(0, this.maxTokenSize()) + "
" + result.termText().substring(this.maxTokenSize()));
			
		result = new org.apache.lucene.analysis.Token(result.termText().substring(0,
this.maxTokenSize()), result.startOffset(), result.endOffset(),
result.type());
	}
		
	return result;
}

-----------------------------

Hope this will be helpful to someone.

Regards,

Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org