From java-user-return-15187-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Tue Jun 28 18:41:27 2005 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 49878 invoked from network); 28 Jun 2005 18:41:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Jun 2005 18:41:26 -0000 Received: (qmail 80672 invoked by uid 500); 28 Jun 2005 18:41:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 80647 invoked by uid 500); 28 Jun 2005 18:41:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 80634 invoked by uid 99); 28 Jun 2005 18:41:18 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2005 11:41:18 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of brogar@gmail.com designates 64.233.184.201 as permitted sender) Received: from [64.233.184.201] (HELO wproxy.gmail.com) (64.233.184.201) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2005 11:41:17 -0700 Received: by wproxy.gmail.com with SMTP id i5so645775wra for ; Tue, 28 Jun 2005 11:41:15 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=kj/jVx+0OmHCZKnFYGYsWzDvNHFhNAA1290kJImXhkIsBaBm6AiETrFSi4pf4BRNIBsc9/Ba3xiH+jc7e5+aOB6MWUbBb2Cs2bkafTm2wxxqPxZAzvUQ2cUcz7PoI+JPWfP9rnjMyzyBzqLWMWR5WeUqSJRZMWX4xVUKQOtEodY= Received: by 10.54.11.25 with SMTP id 25mr23162wrk; Tue, 28 Jun 2005 11:41:15 -0700 (PDT) Received: by 10.54.62.2 with HTTP; Tue, 28 Jun 2005 11:41:14 -0700 (PDT) Message-ID: <34cc3b0a05062811415ea18747@mail.gmail.com> Date: Tue, 28 Jun 2005 14:41:14 -0400 From: Chris D Reply-To: Chris D To: java-user@lucene.apache.org Subject: Re: Indexing puncutation In-Reply-To: <14FBF41EF1411B45B2EC4ADEAC53D1310342B688@MAIL01.wescodist.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <14FBF41EF1411B45B2EC4ADEAC53D1310342B688@MAIL01.wescodist.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On 6/28/05, Aigner, Thomas wrote: > Hello all, >=20 > I am VERY new to Lucene and we are trying out Lucene to see if > it will accomplish the vast majority of our search functions. >=20 > I have a question about a good way to index some of our product > description codes. We have description codes like 21-MA-GAB and other > punctuation. Our users need to be able to search for "21 MA GAB" or > "21-MA_GAB" or "21MAGAB". Is the best way to accomplish this by > creating synonyms for the 3 different ways when punctuation is in parts > to search for? I know I can stop punctuation in the index but what about > grouping the information together or with spaces? >=20 > Thanks all in advance, > Tom There is a couple ways to do this, and I'm not sure which would be best. (I'm also fairly new to lucene) You can create a grammar that recognizes your product codes (see StandardAnalyzer code for examples on how to do that) then use a custom filter to normalize everything. Forgive my poor lex but general idea | ("-"|"_"|""|" ") + ("-"|"_"|""|" ") + > Then in the filter, normalize to strip out all of the punctuation. This can be done with a regex or something faster but just for reference. if (type =3D=3D CODE_TYPE) { return new org.apache.lucene.analysis.Token(text.replaceAll("-", ""), t.startOffset(), t.endOffset(), type); } ...=20 See StandardAnalyzer, it has a lot of code that would do what you need and you can copy, paste and edit. You could also do synonyms but that seems like it would be more overhead. If you think of a better way, let me know, I have to do something similar. Cheers, Chris --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org