Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 70734 invoked from network); 21 Oct 2002 23:20:13 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 21 Oct 2002 23:20:13 -0000 Received: (qmail 9586 invoked by uid 97); 21 Oct 2002 23:21:06 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 9544 invoked by uid 97); 21 Oct 2002 23:21:05 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 9522 invoked by uid 98); 21 Oct 2002 23:21:04 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: Is there a StrlenFilter yet? Date: Mon, 21 Oct 2002 16:20:11 -0700 Message-ID: <728DA21B8941A843A7C496F1ACF485185AD374@gleam.lumos.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [patch] Factory method to preload a Directory into a RAMDirectory Thread-Index: AcJ4Xn3wh4tZaJXoRzWkPsNYFMDTygA9Sxkg From: "Spencer, Dave" To: "Lucene Developers List" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Use case - you want to protect yourself against pathalogical docs such as one with a string of a million consectutive characters - any normal tokenizer will consider this one big token but there's probably no point in indexing a string that is a million characters long. One example is indexing a mailing list which could contain uuencoded attachments - there could be lots of lamo lines 72 or so chars long. Anyway - I've attached a possible impl. Discussion question is, let's say the filter is told to only return tokens <=3D 5 chars long (note: I think 16 or so would be more realistic for most docs -this is just for sake of example). What if there is one 6 chars long then i.e. longer than the limit - say it=20 is "abcdef". Then either: [a] we ignore "abcdef" and assume it is garbage or [b] we return "abcde" and "bcdef" i.e. all 5 char substrings of it, so that if someone wants to search on the 6 char string they sort of still can (at least w/ a carefully chosen query...hmmm..). Anyway here's some code. If popular it could be put into StandardAnalyzer. -------- package com.tropo.lucene; import java.io.IOException; import org.apache.lucene.analysis.*; /** * Removes words that are too long and too short from the stream */ public final class StrlenFilter extends TokenFilter { /** * Build a filter that removes words that are too long or too short from the text. */ public StrlenFilter(TokenStream in, int min, int max) { input =3D in; this.min =3D min; this.max =3Dmax; } /** Returns the next input Token whose termText() is the right len */ public final Token next() throws IOException { // return the first non-stop word found for (Token token =3D input.next(); token !=3D null; token =3D input.next()) { final int len =3D token.termText().length(); if ( len >=3D min && len <=3D max) return token; // note: else we ignore it but should we index each part of it? } // reached EOS -- return null =09 return null; } final int min; final int max; } -- To unsubscribe, e-mail: For additional commands, e-mail: