Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 3840 invoked from network); 28 Jun 2005 19:37:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Jun 2005 19:37:33 -0000 Received: (qmail 96316 invoked by uid 500); 28 Jun 2005 19:37:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 96286 invoked by uid 500); 28 Jun 2005 19:37:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 96272 invoked by uid 99); 28 Jun 2005 19:37:24 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2005 12:37:24 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of brogar@gmail.com designates 64.233.184.199 as permitted sender) Received: from [64.233.184.199] (HELO wproxy.gmail.com) (64.233.184.199) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2005 12:37:23 -0700 Received: by wproxy.gmail.com with SMTP id i5so655358wra for ; Tue, 28 Jun 2005 12:37:20 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=my4Abi/SQY3oo5gi0MdUraOcC+56q7RbdKwktaFAm0scfFMuMFEkl7coSjF5zSfB/jOhF+SGXgVvE/nwldz5n8nUQZYsLwoO20v/EMlXgoZjtQ3BnhZ4O7dK0NJ0v7JN+6Czv9c0t1vRBh3k9V3r1AKEsy2jPvFdjAmfMl45k9c= Received: by 10.54.29.62 with SMTP id c62mr50980wrc; Tue, 28 Jun 2005 12:37:20 -0700 (PDT) Received: by 10.54.62.2 with HTTP; Tue, 28 Jun 2005 12:37:20 -0700 (PDT) Message-ID: <34cc3b0a05062812372d08a050@mail.gmail.com> Date: Tue, 28 Jun 2005 15:37:20 -0400 From: Chris D Reply-To: Chris D To: java-user@lucene.apache.org Subject: Re: Indexing puncutation In-Reply-To: <14FBF41EF1411B45B2EC4ADEAC53D1310342B6FE@MAIL01.wescodist.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <14FBF41EF1411B45B2EC4ADEAC53D1310342B6FE@MAIL01.wescodist.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On 6/28/05, Aigner, Thomas wrote: > Thanks for the info Chris. >=20 >=20 >=20 > I'd thought I'd provide some more infomation. One problem is the > descriptions are not easily formatted. In other words, the description > doesn't follow a certain set of rules (num num - alpha alpha etc). They > are literally anything a supplier has put in for them. >=20 >=20 >=20 > The example below (21-MA-GAB) is stored differently by these analyzers: >=20 > WhitespaceAnalyzer: [21-MA-GAB] >=20 > SimpleAnalyzer: [ma] [gab] >=20 > StopAnalyzer: [ma] [gab] >=20 > StandardAnalyzer: [21-ma] [gab] >=20 > SynonymAnalyzer: [21-ma] [gab] >=20 > (One I created for synonyms.. much like the standard one) >=20 > SnowballAnalyzer: [21-ma] [gab] >=20 >=20 >=20 > My problem is searching for 21magab returns nothing as well as 21ma* > etc.. >=20 >=20 >=20 > This is just one of my punctuation problems.. there can be "" for inches > and 1/2 items etc.. >=20 >=20 >=20 > I am currently using my SynonymnAnalyzer for some aliases to build the > index and the SnowballAnalyzer to query the index (nice stemming in it) >=20 >=20 >=20 > Tom You can write an analyzer to do your tokenizing so that you end up with=20 21-MA-GAB being stored as [21magab] in the index. Assuming the codes are formatted mostly the same. That's what I was suggesting, not use a different analyzer. (If they're not the same then it becomes more difficult) The other problems you're describing with the descriptions could also be solved with a proper analyzer. Add a "FRACTION" type to the lexical grammar, and don't strip punctuation like the quote. Or synonym "1/2" to "half" I guess (I haven't done much work with synonyms). Lastly, and someone should correct me if I'm wrong, but you should always use the same analyzer to create and to query the index. Otherwise queries that should return hits wont. For instance the following. The canoist paddles Could be indexed as [boater] [strokes]... And the query contents:paddles would be parsed to [paddle] and likely would not get the hit you expect. Cheers, Chris --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org