Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 55900 invoked from network); 1 Oct 2007 13:55:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Oct 2007 13:55:55 -0000 Received: (qmail 83933 invoked by uid 500); 1 Oct 2007 13:55:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 83905 invoked by uid 500); 1 Oct 2007 13:55:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 83894 invoked by uid 99); 1 Oct 2007 13:55:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 06:55:38 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [194.125.145.37] (HELO mercury.propylon.com) (194.125.145.37) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 13:55:38 +0000 Received: from propylon-sdsl.sdsl.esat.net ([193.120.101.26] helo=[192.168.213.155]) by mercury.propylon.com with esmtp (Exim 4.50) id 1IcLgX-0000uv-Fo for java-user@lucene.apache.org; Mon, 01 Oct 2007 14:52:17 +0100 Message-ID: <4700FC1C.6010707@propylon.com> Date: Mon, 01 Oct 2007 14:54:36 +0100 From: John Byrne User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: java-user@lucene.apache.org References: <4700F72B.1010609@propylon.com> <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com> In-Reply-To: <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-SA-Exim-Connect-IP: 193.120.101.26 X-SA-Exim-Mail-From: john.byrne@propylon.com Subject: Re: Indexing puncuation and symbols X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on mercury.propylon.com X-Spam-Level: X-SA-Exim-Version: 4.2 (built Thu, 03 Mar 2005 10:44:12 +0100) X-SA-Exim-Scanned: Yes (on mercury.propylon.com) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=failed version=3.0.3 Whitespace analyzer does preserve those symbols, but not as tokens. It simply leaves them attached to the original term. As an example of what I'm talking about, consider a document that contains (without the quotes) "foo, ". Now, using WhitespaceAnalyzer, I could only get that document by searching for "foo,". Using StandardAnalyzer or any analyzer that removes punctuation, I could only find it by searching for "foo". I want an analyzer that will allow me to find it if I build a phrase query with the term "foo" followed immediately by ",". After all, the comma may be relevant to the search, but is definitely not part of the word. Extending StandardAnalyer is what I had in mind, but I don't know where to start. I also wonder why no-one seems to have done it before- it makes me suspect that there's some reason I haven't seen yet that makes it impossible ot impractical. Karl Wettin wrote: > > 1 okt 2007 kl. 15.33 skrev John Byrne: > >> Has anyone written an analyzer that preserves puncuation and >> synmbols ("�", "$", "%" etc.) as tokens? > > WhitespaceAnalyzer? > > You could also extend the lexical rules of StandardAnalyzer. > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org