Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59856 invoked from network); 1 Oct 2007 14:03:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Oct 2007 14:03:52 -0000 Received: (qmail 10303 invoked by uid 500); 1 Oct 2007 14:03:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10104 invoked by uid 500); 1 Oct 2007 14:03:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10093 invoked by uid 99); 1 Oct 2007 14:03:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 07:03:36 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of patrek@gmail.com designates 66.249.92.175 as permitted sender) Received: from [66.249.92.175] (HELO ug-out-1314.google.com) (66.249.92.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 14:03:37 +0000 Received: by ug-out-1314.google.com with SMTP id a2so2077209ugf for ; Mon, 01 Oct 2007 07:03:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=LAxo0rV0XTEf9hWtuCj5bOoxrNwqcsS7YYsaRgK6TZY=; b=YqhdUw8Xsc8WThl08t+vJTvCVU+U0ICiCreMJZmqLRcTqvxUzFQZZyhMAvEvVZj9BZU3cuLn0+jouZ851l8AswOMa5SdFkr6qLqhI3IjiUmi5EPkNJ9scCUeLtq1oMlgLq94jiTTRpvSJ5X0cgbUhLwUhoJmpzIKwqInCZigZD4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=r2lip5moNXtALen4Ffx2iPs5PXVotTnADjCQ6e5GSoQbiXLGVozmWZGNuonn5YZk+ijHqmhGh5P6MduZR30uAQW5HzfRhymWe+tlOnS+vRc258Od7xxY3cgzgRVyiprXu7sADAE9Ns9fQqVEzKFc4Ze9LbyjfM2Hz2eBhgp3IiE= Received: by 10.66.216.8 with SMTP id o8mr8861914ugg.1191247395957; Mon, 01 Oct 2007 07:03:15 -0700 (PDT) Received: by 10.67.87.20 with HTTP; Mon, 1 Oct 2007 07:03:15 -0700 (PDT) Message-ID: <48b038c60710010703u1370e6cu170d15fe1480f607@mail.gmail.com> Date: Mon, 1 Oct 2007 10:03:15 -0400 From: "Patrick Turcotte" To: java-user@lucene.apache.org Subject: Re: Indexing puncuation and symbols In-Reply-To: <4700FC1C.6010707@propylon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <4700F72B.1010609@propylon.com> <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com> <4700FC1C.6010707@propylon.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi, Don't know the size of your dataset. But, couldn't you index in 2 fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field, and WhiteSpace for the other. Then use multiple field query (there is a query parser for that, just don't remember the name right now). Patrick On 10/1/07, John Byrne wrote: > Whitespace analyzer does preserve those symbols, but not as tokens. It > simply leaves them attached to the original term. > > As an example of what I'm talking about, consider a document that > contains (without the quotes) "foo, ". > > Now, using WhitespaceAnalyzer, I could only get that document by > searching for "foo,". Using StandardAnalyzer or any analyzer that > removes punctuation, I could only find it by searching for "foo". > > I want an analyzer that will allow me to find it if I build a phrase > query with the term "foo" followed immediately by ",". After all, the > comma may be relevant to the search, but is definitely not part of the > word. > > Extending StandardAnalyer is what I had in mind, but I don't know where > to start. I also wonder why no-one seems to have done it before- it > makes me suspect that there's some reason I haven't seen yet that makes > it impossible ot impractical. > > > > Karl Wettin wrote: > > > > 1 okt 2007 kl. 15.33 skrev John Byrne: > > > >> Has anyone written an analyzer that preserves puncuation and > >> synmbols ("=A3", "$", "%" etc.) as tokens? > > > > WhitespaceAnalyzer? > > > > You could also extend the lexical rules of StandardAnalyzer. > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org