lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giulio Cesare Solaroli <giulio.ces...@gmail.com>
Subject Suggestion to improve the LenghtFilter
Date Thu, 15 Jul 2004 07:28:00 GMT
Hi all,

the LenghtFilter available in the Lucene sandbox drops altogether all
tokens with a lenght outside the range defined during the construction
of the filter.

In my opinion, dropping longer tokens is too drastic; it should be
much better to truncate the token and index only the first part.

Eventually, it would be possible to set another (much higher limit)
for tokens to be ignored (to avoid filenames with full path or other
strange things that could be found inside documents).

In this way you can safely set a "short" maximum length (we are trying
with a value of 12) without fearing to loose any meaningful
information. If the tokens exceding the maximum lenght were to be
dropped, a much higher value would be needed (possibly around 20).

Having a maximum length of the tokens of 12 should help reducing the
number of distinct token stored into the index and thus improving the
"explosion" of wild char used in queries.

This is our alternative implementation of the next() of the
LengthFilter to handle this policy:

-----------------------------

public org.apache.lucene.analysis.Token next() throws java.io.IOException {
	org.apache.lucene.analysis.Token	result;

	do {
		result = input.next();
	} while ((result != null) && (result.termText().length() <=
this.minTokenSize()));
		
	if ((result != null) && (result.termText().length() > this.maxTokenSize())) {
		logger.debug(result.termText().substring(0, this.maxTokenSize()) + "
" + result.termText().substring(this.maxTokenSize()));
			
		result = new org.apache.lucene.analysis.Token(result.termText().substring(0,
this.maxTokenSize()), result.startOffset(), result.endOffset(),
result.type());
	}
		
	return result;
}

-----------------------------

Hope this will be helpful to someone.

Regards,

Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message