Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 15687 invoked from network); 15 Jul 2004 07:28:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 15 Jul 2004 07:28:11 -0000 Received: (qmail 16496 invoked by uid 500); 15 Jul 2004 07:28:05 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 16279 invoked by uid 500); 15 Jul 2004 07:28:04 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 16247 invoked by uid 99); 15 Jul 2004 07:28:04 -0000 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=RCVD_BY_IP,SB_NEW_BULK,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received: from [64.233.170.192] (HELO mproxy.gmail.com) (64.233.170.192) by apache.org (qpsmtpd/0.27.1) with SMTP; Thu, 15 Jul 2004 00:28:01 -0700 Received: by mproxy.gmail.com with SMTP id d15so454679rng for ; Thu, 15 Jul 2004 00:28:00 -0700 (PDT) Received: by 10.38.92.23 with SMTP id p23mr37532rnb; Thu, 15 Jul 2004 00:28:00 -0700 (PDT) Message-ID: Date: Thu, 15 Jul 2004 09:28:00 +0200 From: Giulio Cesare Solaroli To: Lucene Developers List Subject: Suggestion to improve the LenghtFilter Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi all, the LenghtFilter available in the Lucene sandbox drops altogether all tokens with a lenght outside the range defined during the construction of the filter. In my opinion, dropping longer tokens is too drastic; it should be much better to truncate the token and index only the first part. Eventually, it would be possible to set another (much higher limit) for tokens to be ignored (to avoid filenames with full path or other strange things that could be found inside documents). In this way you can safely set a "short" maximum length (we are trying with a value of 12) without fearing to loose any meaningful information. If the tokens exceding the maximum lenght were to be dropped, a much higher value would be needed (possibly around 20). Having a maximum length of the tokens of 12 should help reducing the number of distinct token stored into the index and thus improving the "explosion" of wild char used in queries. This is our alternative implementation of the next() of the LengthFilter to handle this policy: ----------------------------- public org.apache.lucene.analysis.Token next() throws java.io.IOException { org.apache.lucene.analysis.Token result; do { result = input.next(); } while ((result != null) && (result.termText().length() <= this.minTokenSize())); if ((result != null) && (result.termText().length() > this.maxTokenSize())) { logger.debug(result.termText().substring(0, this.maxTokenSize()) + " " + result.termText().substring(this.maxTokenSize())); result = new org.apache.lucene.analysis.Token(result.termText().substring(0, this.maxTokenSize()), result.startOffset(), result.endOffset(), result.type()); } return result; } ----------------------------- Hope this will be helpful to someone. Regards, Giulio Cesare Solaroli --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org